[PATCH RFC v3 tip/core/rcu 0/34] Prototype RCU usable from idle, exception, offline

From: Paul E. McKenney
Date: Fri Mar 27 2020 - 18:23:50 EST


Hello!

This series provides two variants of Tasks RCU, a rude variant inspired
by Steven Rostedt's use of schedule_on_each_cpu(), and a tracing variant
requested by the BPF folks to be used (for example) to protect BPF
programs that unconditionally access userspace memory, and thus might
occasionally take a page fault, resulting in a voluntary context switch.

The rude variant uses context switches and offline as its quiescent
states, so that preempt-disabled regions of code executing on online
CPUs form the tasks rude RCU readers.

The tracing variant has explicit read-side markers to permit finite grace
periods even given in-kernel loops in PREEMPT=n builds. These markers
are rcu_read_lock_trace() and rcu_read_unlock_trace(), so that any code
not under rcu_read_lock_trace() is a quiescent state. This variant
also protects marked code in the idle loop, on exception entry/exit
paths, and on the various CPU-hotplug online/offline code paths, thus
having protection properties similar to SRCU. However, unlike SRCU,
this variant avoids expensive instructions in the read-side primitives,
thus having read-side overhead similar to that of preemptible RCU.
This difference is important for some BPF programs, according to
benchmarking from Alexei Starovoitov:

https://lore.kernel.org/lkml/20200310014043.4dbagqbr2wsbuarm@ast-mbp/

There are of course downsides. The grace-period code can send IPIs to
CPUs, even when those CPUs are in the idle loop or in nohz_full userspace.
However, this version enlists the aid of the context-switch hooks,
which eliminates the need for IPIs in context-switch-heavy workloads.
It also prohibits sending of IPIs early in the grace period based on a
new rcupdate.rcu_task_ipi_delay kernel boot parameter, which provides
additional opportunity for the hooks to do their job. Finally, a new
TASKS_TRACE_RCU_READ_MB Kconfig option avoids sending IPIs to tasks
executing userspace or in the idle loop, at the expense of higher overhead
readers during kernel entry/exit code and in the idle loop.

It is also necessary to scan the full tasklist, much as for Tasks RCU.
There is a single callback queue guarded by a single lock, again, much
as for Tasks RCU. If needed, these downsides can be at least partially
remedied.

Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched. The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way allows rcu_preempt and rcu_sched readers to also do so.

The RCU tasks trace mechanism is based off of RCU tasks rather than
SRCU because the latter is more complex and also because the latter
uses a CPU-by-CPU approach to tracking quiescent states instead of the
task-by-task approach that is needed. It is in theory possible to mash
RCU tasks trace into the Tree SRCU implementation, but there will need
to be extremely good reasons for doing so. The vanilla RCU mechanism
could in theory be used in CONFIG_PREEMPT=y kernels, but fails utterly
in CONFIG_PREEMPT=n kernels. Tasks RCU does not work because page
faults can result in a voluntary context switch, which prevents it from
protecting a BPF program that page faults. The new "rude" variant only
protected preempt-disable regions of code, thus also failing to protect
BPF programs that page fault.

This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko, as well as from
numerous on-list discussions, at least some of which are captured in the
"Link:" tags on the patches themselves.

The patches in this series are as follows, with asterisks indicating
significant change from v1:

1*. Add function to sample state of a locked-down task. Added
the task_struct argument to the callback function.

2*. Use the above function to add per-task state to RCU CPU stall
warnings. This commit was adapted to the updated API.

3. Add rcutorture module parameter to produce non-busy-wait task
stalls, thus allowing the above RCU CPU stall change to be
exercised.

4. Move Tasks RCU to its own file.

5. Create struct to hold RCU-tasks state information.

6. Reinstate synchronize_rcu_mult(), as there will likely once
again be a need to wait on multiple flavors of RCU.

7. Add an rcutorture test for synchronize_rcu_mult().

8. Refactor RCU-tasks to allow variants to be added.

9*. Add an RCU-tasks rude variant, based on Steven Rostedt's
use of schedule_on_each_cpu(). Updated Kconfig default
to rely on default default value, updated help text, and
updated the header comment.

10. Add torture tests for RCU Tasks Rude.

11. Use unique names for RCU-Tasks kthreads and messages.

12. Further refactor RCU-tasks to allow adding even more variants.

13. Code movement to allow even more Tasks RCU variants.

14*. Add an RCU Tasks Trace to simplify protection of tracing hooks,
including BPF. This version fixes even more bugs and adds a
URL to an email explaining the memory ordering. It also updates
the Kconfig default and updates the help text. Furthermore, it
moves a misplaced comment update. Finally, it makes the
rcu_read_unlock_trace() function safe for scheduler locks,
interrupt handlers, and NMI handlers.

15. Add torture tests for RCU Tasks Trace.

16. Add stall warnings for RCU Tasks Trace.

17. Move #ifdef into tasks.h to ease addition of Kconfig-dependent APIs.

18. Add RCU-tasks-specific information to rcutorture writer stall
output, easing debugging of these RCU variants.

19. Make the above rcutorture writer stall output include
grace-period state.

20. Cause RCU tasks trace to take advantage of RCU scheduler hooks,
thus reducing the number of IPIs.

21. Record grace-period start time for RCU tasks variants for
IPI throttling and for debugging.

22. Provide a kernel boot parameter to delay IPIs until a given grace
period reaches the specified age, with this age defaulting to
half a second, further reducing the number of IPIs. To zero on
context-switch-heavy workloads.

23*. Split ->trc_reader_need_end to make room for memory-barrier
indication.

24*. Add grace-period and IPI counts to statistics.

25*. Add Kconfig option to mediate smp_mb() vs. IPI.

26*. Avoid IPIing userspace/idle tasks if kernel is so built.

27*. Allow rcu_read_unlock_trace() under scheduler locks.

28*. Disable CPU hotplug across RCU tasks trace scans. This enables
detection of idle tasks for offline CPUs.

29*. Handle the running-offline idle-task special case.

30*. Make RCU tasks trace also wait for idle tasks.

31*. Add rcu_dynticks_zero_in_eqs() effectiveness statistics.

32*. Add count for idle tasks on offline CPUs.

33*. Add TRACE02 scenario enabling RCU Tasks Trace IPIs.
The existing TRACE01 scenario avoids IPIs to userspace
and idle CPUs.

34*. Add IPI failure count to statistics.

These new versions of Tasks RCU now pass heavy rcutorture testing, and
should thus be fine for experimental use. The original Tree RCU went
upstream with less testing than this has seen, but then again those were
simpler times. ;-)

Changes since v2:

o Leveraged idle entry/exit hooks to reduce IPIing of idle and
userspace tasks.

o Switch to read-side memory barriers during idle and userspace
execution in kernels built for real-time or battery-powered use,
mediated by a new TASKS_TRACE_RCU_READ_MB Kconfig option. Also
add an rcutorture test scenario for this option.

o Adjust rcutorture to better test the IPI path. (Seeing zero IPIs
might be satisfying to me personally, but it is a lousy test
strategy!)

o Added more information to stall warnings and rcutorture
end-of-test printout.

o Make rcu_read_unlock_trace() usable when invoked with
scheduler locks held.

o Make rcu_read_unlock_trace() usable in interrupt and NMI
handlers.

o Fix handling of idle tasks, including those "running" on
offline CPUs.

o Fixed a number of other bugs found during testing and responded
to review feedback.

Changes since v1:

o Updated this cover letter to provide more detail, including
on roads not taken.

o Updated commit logs based on feedback from v1.

o Updated the function providing a consistent view of the
specified non-running task's state to invoke the specified
function even if the task is currently running. This will
be necessary to safely eliminate IPIs for long-term idle and
userspace execution. The function may also now return false
to transmit a failure indication to the caller, for example,
if the function cannot handle being invoked on a running CPU.
The function is now passed the relevant task_struct pointer as
well as the specified argument.

Changes were of course made to use the new API.

o Leveraged context-switch hooks to avoid unnecessary IPIs.

o Held off IPIs for the first half second (by default) of each
grace period to give the context-switch hooks a better chance
to do their job.

o Lots of testing.

o Fixed a number of bugs and responded to v2 feedback.

Todo:

o Even more testing.

o If all goes well, post a non-RFC series.

Thanx, Paul

------------------------------------------------------------------------

Documentation/admin-guide/kernel-parameters.txt | 12
include/linux/rcupdate.h | 48
include/linux/rcupdate_trace.h | 98
include/linux/rcupdate_wait.h | 19
include/linux/rcutiny.h | 2
include/linux/sched.h | 12
include/linux/wait.h | 2
init/init_task.c | 5
kernel/fork.c | 5
kernel/rcu/Kconfig | 50
kernel/rcu/Kconfig.debug | 4
kernel/rcu/rcu.h | 5
kernel/rcu/rcutorture.c | 99
kernel/rcu/tasks.h | 2089 +++++++++---
kernel/rcu/tree.c | 24
kernel/rcu/tree.h | 2
kernel/rcu/tree_plugin.h | 24
kernel/rcu/tree_stall.h | 40
kernel/rcu/update.c | 375 --
kernel/sched/core.c | 48
tools/testing/selftests/rcutorture/configs/rcu/CFLIST | 3
tools/testing/selftests/rcutorture/configs/rcu/RUDE01 | 10
tools/testing/selftests/rcutorture/configs/rcu/RUDE01.boot | 1
tools/testing/selftests/rcutorture/configs/rcu/TRACE01 | 11
tools/testing/selftests/rcutorture/configs/rcu/TRACE01.boot | 1
tools/testing/selftests/rcutorture/configs/rcu/TRACE02 | 11
tools/testing/selftests/rcutorture/configs/rcu/TRACE02.boot | 1
27 files changed, 2140 insertions(+), 861 deletions(-)