Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

From: Steven Rostedt
Date: Wed Feb 21 2024 - 13:17:22 EST


On Mon, 19 Feb 2024 08:48:20 -0800
"Paul E. McKenney" <paulmck@xxxxxxxxxx> wrote:

> > I will look again -- it is quite possible that I was confused by earlier
> > in-fleet setups that had Tasks RCU enabled even when preemption was
> > disabled. (We don't do that anymore, and, had I been paying sufficient
> > attention, would not have been doing it to start with. Back in the day,
> > enabling rcutorture, even as a module, had the side effect of enabling
> > Tasks RCU. How else to test it, right? Well...)
>
> OK, I got my head straight on this one...
>
> And the problem is in fact that Tasks RCU isn't normally present
> in non-preemptible kernels. This is because normal RCU will wait
> for preemption-disabled regions of code, and in PREMPT_NONE and
> PREEMPT_VOLUNTARY kernels, that includes pretty much any region of code
> lacking an explicit schedule() or similar. And as I understand it,
> tracing trampolines rely on this implicit lack of preemption.
>
> So, with lazy preemption, we could preempt in the middle of a
> trampoline, and synchronize_rcu() won't save us.
>
> Steve and Mathieu will correct me if I am wrong.
>
> If I do understand this correctly, one workaround is to remove the
> "if PREEMPTIBLE" on all occurrences of "select TASKS_RCU". That way,
> all kernels would use synchronize_rcu_tasks(), which would wait for
> a voluntary context switch.
>
> This workaround does increase the overhead and tracepoint-removal
> latency on non-preemptible kernels, so it might be time to revisit the
> synchronization of trampolines. Unfortunately, the things I have come
> up with thus far have disadvantages:
>
> o Keep a set of permanent trampolines that enter and exit
> some sort of explicit RCU read-side critical section.
> If the address for this trampoline to call is in a register,
> then these permanent trampolines remain constant so that
> no synchronization of them is required. The selected
> flavor of RCU can then be used to deal with the non-permanent
> trampolines.
>
> The disadvantage here is a significant increase in the complexity
> and overhead of trampoline code and the code that invokes the
> trampolines. This overhead limits where tracing may be used
> in the kernel, which is of course undesirable.

I wonder if we can just see if the instruction pointer at preemption is at
something that was allocated? That is, if it __is_kernel(addr) returns
false, then we need to do more work. Of course that means modules will also
trigger this. We could check __is_module_text() but that does a bit more
work and may cause too much overhead. But who knows, if the module check is
only done if the __is_kernel() check fails, maybe it's not that bad.

-- Steve

>
> o Check for being preempted within a trampoline, and track this
> within the tasks structure. The disadvantage here is that this
> requires keeping track of all of the trampolines and adding a
> check for being in one on a scheduler fast path.
>
> o Have a variant of Tasks RCU which checks the stack of preempted
> tasks, waiting until all have been seen without being preempted
> in a trampoline. This still requires keeping track of all the
> trampolines in an easy-to-search manner, but gets the overhead
> of searching off of the scheduler fastpaths.
>
> It is also necessary to check running tasks, which might have
> been interrupted from within a trampoline.
>
> I would have a hard time convincing myself that these return
> addresses were unconditionally reliable. But maybe they are?
>
> o Your idea here!
>
> Again, the short-term workaround is to remove the "if PREEMPTIBLE" from
> all of the "select TASKS_RCU" clauses.
>
> > > > My next step is to try this on bare metal on a system configured as
> > > > is the fleet. But good progress for a week!!!
> > >
> > > Yeah this is great. Fingers crossed for the wider set of tests.
> >
> > I got what might be a one-off when hitting rcutorture and KASAN harder.
> > I am running 320*TRACE01 to see if it reproduces.
>
> [ . . . ]
>
> > So, first see if it is reproducible, second enable more diagnostics,
> > third make more grace-period sequence numbers available to rcutorture,
> > fourth recheck the diagnostics code, and then see where we go from there.
> > It might be that lazy preemption needs adjustment, or it might be that
> > it just tickled latent diagnostic issues in rcutorture.
> >
> > (I rarely hit this WARN_ON() except in early development, when the
> > problem is usually glaringly obvious, hence all the uncertainty.)
>
> And it is eminently reproducible. Digging into it...