Re: [PATCH RFC] sched: Make wake_up_nohz_cpu() handle CPUs going offline

From: Frederic Weisbecker
Date: Thu Jun 30 2016 - 19:30:23 EST


On Thu, Jun 30, 2016 at 10:58:45AM -0700, Paul E. McKenney wrote:
> Both timers and hrtimers are maintained on the outgoing CPU until
> CPU_DEAD time, at which point they are migrated to a surviving CPU. If a
> mod_timer() executes between CPU_DYING and CPU_DEAD time, x86 systems
> will splat in native_smp_send_reschedule() when attempting to wake up
> the just-now-offlined CPU, as shown below from a NO_HZ_FULL kernel:
>
> [ 7976.741556] WARNING: CPU: 0 PID: 661 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:125 native_smp_send_reschedule+0x39/0x40
> [ 7976.741595] Modules linked in:
> [ 7976.741595] CPU: 0 PID: 661 Comm: rcu_torture_rea Not tainted 4.7.0-rc2+ #1
> [ 7976.741595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> [ 7976.741595] 0000000000000000 ffff88000002fcc8 ffffffff8138ab2e 0000000000000000
> [ 7976.741595] 0000000000000000 ffff88000002fd08 ffffffff8105cabc 0000007d1fd0ee18
> [ 7976.741595] 0000000000000001 ffff88001fd16d40 ffff88001fd0ee00 ffff88001fd0ee00
> [ 7976.741595] Call Trace:
> [ 7976.741595] [<ffffffff8138ab2e>] dump_stack+0x67/0x99
> [ 7976.741595] [<ffffffff8105cabc>] __warn+0xcc/0xf0
> [ 7976.741595] [<ffffffff8105cb98>] warn_slowpath_null+0x18/0x20
> [ 7976.741595] [<ffffffff8103cba9>] native_smp_send_reschedule+0x39/0x40
> [ 7976.741595] [<ffffffff81089bc2>] wake_up_nohz_cpu+0x82/0x190
> [ 7976.741595] [<ffffffff810d275a>] internal_add_timer+0x7a/0x80
> [ 7976.741595] [<ffffffff810d3ee7>] mod_timer+0x187/0x2b0
> [ 7976.741595] [<ffffffff810c89dd>] rcu_torture_reader+0x33d/0x380
> [ 7976.741595] [<ffffffff810c66f0>] ? sched_torture_read_unlock+0x30/0x30
> [ 7976.741595] [<ffffffff810c86a0>] ? rcu_bh_torture_read_lock+0x80/0x80
> [ 7976.741595] [<ffffffff8108068f>] kthread+0xdf/0x100
> [ 7976.741595] [<ffffffff819dd83f>] ret_from_fork+0x1f/0x40
> [ 7976.741595] [<ffffffff810805b0>] ? kthread_create_on_node+0x200/0x200
>
> However, in this case, the wakeup is redundant, because the timer
> migration will reprogram timer hardware as needed. Note that the fact
> that preemption is disabled does not avoid the splat, as the offline
> operation has already passed both the synchronize_sched() and the
> stop_machine() that would be blocked by disabled preemption.
>
> This commit therefore modifies wake_up_nohz_cpu() to avoid attempting
> to wake up offline CPUs. It also adds a comment stating that the
> caller must tolerate lost wakeups when the target CPU is going offline,
> and suggesting the CPU_DEAD notifier as a recovery mechanism.
>
> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> ---
> core.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7f2cae4620c7..08502966e7df 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -590,9 +590,14 @@ static bool wake_up_full_nohz_cpu(int cpu)
> return false;
> }
>
> +/*
> + * Wake up the specified CPU. If the CPU is going offline, it is the
> + * caller's responsibility to deal with the lost wakeup, for example,
> + * by hooking into the CPU_DEAD notifier like timers and hrtimers do.
> + */
> void wake_up_nohz_cpu(int cpu)
> {
> - if (!wake_up_full_nohz_cpu(cpu))
> + if (cpu_online(cpu) && !wake_up_full_nohz_cpu(cpu))

So at this point, as we passed CPU_DYING, I believe the CPU isn't visible in the domains
anymore (correct me if I'm wrong), therefore get_nohz_timer_target() can't return it,
unless smp_processor_id() is the only alternative.

Hence, that call to wake_up_nohz_cpu() can only happen to online CPUs or the current
one (pinned). And wake_up_idle_cpu() on the current CPU is a no-op. So only
wake_up_full_nohz_cpu() is concerned. Then perhaps it would be better to move that
cpu_online() check to wake_up_full_nohz_cpu() ?

BTW, it seems that rcutorture stops its kthreads after CPU_DYING, is it expected that
it queues timers at this stage?

Thanks.

> wake_up_idle_cpu(cpu);
> }
>
>