Re: [PATCH rcu 3/3] rcu: Allow expedited RCU grace periods on incoming CPUs

From: Mukesh Ojha
Date: Thu Feb 17 2022 - 11:14:03 EST



On 2/15/2022 11:09 PM, Paul E. McKenney wrote:
On Tue, Feb 15, 2022 at 07:53:10PM +0530, Mukesh Ojha wrote:
On 2/14/2022 10:14 PM, Paul E. McKenney wrote:
On Thu, Feb 10, 2022 at 12:38:11AM +0100, Frederic Weisbecker wrote:
On Fri, Feb 04, 2022 at 02:55:07PM -0800, Paul E. McKenney wrote:
Although it is usually safe to invoke synchronize_rcu_expedited() from a
preemption-enabled CPU-hotplug notifier, if it is invoked from a notifier
between CPUHP_AP_RCUTREE_ONLINE and CPUHP_AP_ACTIVE, its attempts to
invoke a workqueue handler will hang due to RCU waiting on a CPU that
the scheduler is not paying attention to. This commit therefore expands
use of the existing workqueue-independent synchronize_rcu_expedited()
from early boot to also include CPUs that are being hotplugged.

Link:https://lore.kernel.org/lkml/7359f994-8aaf-3cea-f5cf-c0d3929689d6@xxxxxxxxxxx/
Reported-by: Mukesh Ojha<quic_mojha@xxxxxxxxxxx>
Cc: Tejun Heo<tj@xxxxxxxxxx>
Signed-off-by: Paul E. McKenney<paulmck@xxxxxxxxxx>
I'm surprised by this scheduler behaviour.

Since sched_cpu_activate() hasn't been called yet,
rq->balance_callback = balance_push_callback. As a result, balance_push() should
be called at the end of schedule() when the workqueue is picked as the next task.
Then eventually the workqueue should be immediately preempted by the stop task to
be migrated elsewhere.

So I must be missing something. For the fun, I booted the following and it
didn't produce any issue:

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 80faf2273ce9..b1e74a508881 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4234,6 +4234,8 @@ int rcutree_online_cpu(unsigned int cpu)
// Stop-machine done, so allow nohz_full to disable tick.
tick_dep_clear(TICK_DEP_BIT_RCU);
+ if (cpu != 0)
+ synchronize_rcu_expedited();
return 0;
}
That does seem compelling. And others have argued that the workqueue
system's handling of offline CPUs should deal with this.

Mukesh, was this a theoretical bug, or did you actually make it happen?
If you made it happen, as seems to have been the case given your original
email [1], could you please post your reproducer?
No, it was not theoretical one. We saw this issue only once in our testing
and i don't think it is easy to reproduce otherwise
it would been fixed by now.

When one of thread calling synchronize_expedite_rcu with timer of 20s but it
did not get the exp funnel
lock for 20s and there we crash it with panic() on timeout.

The other thread cpuhp which was having the lock got stuck at the point
mentioned at the below link.

https://lore.kernel.org/lkml/7359f994-8aaf-3cea-f5cf-c0d3929689d6@xxxxxxxxxxx/
OK. Are you able to create an in-kernel reproducer, perhaps similar to
Frederic's change above?

I am worried that the patch that I am carrying might be fixing some
other bug by accident...

I have started overnight test to reproduce this. let me see if we hit this.
if not, feel free to take decision on this patch.

Thanks,
-Mukesh


Thanx, Paul

e.g Below sample test in combination of many other test in parallel

:loop

adb shell "echo 0 > /sys/devices/system/cpu/cpu0/online"

adb shell "echo 1 > /sys/devices/system/cpu/cpu0/online"

adb shell "echo 0 > /sys/devices/system/cpu/cpu1/online"

adb shell "echo 1 > /sys/devices/system/cpu/cpu1/online"

adb shell "echo 0 > /sys/devices/system/cpu/cpu2/online"

adb shell "echo 1 > /sys/devices/system/cpu/cpu2/online"

adb shell "echo 0 > /sys/devices/system/cpu/cpu3/online"

adb shell "echo 1 > /sys/devices/system/cpu/cpu3/online"

adb shell "echo 0 > /sys/devices/system/cpu/cpu4/online"

adb shell "echo 1 > /sys/devices/system/cpu/cpu4/online"

adb shell "echo 0 > /sys/devices/system/cpu/cpu5/online"

adb shell "echo 1 > /sys/devices/system/cpu/cpu5/online"

adb shell "echo 0 > /sys/devices/system/cpu/cpu6/online"

adb shell "echo 1 > /sys/devices/system/cpu/cpu6/online"

adb shell "echo 0 > /sys/devices/system/cpu/cpu7/online"

adb shell "echo 1 > /sys/devices/system/cpu/cpu7/online"

goto loop



Thanks, Mukesh

Thanx, Paul

[1]https://lore.kernel.org/lkml/7359f994-8aaf-3cea-f5cf-c0d3929689d6@xxxxxxxxxxx/