Re: [PATCH] powerpc/smp: Wait until secondaries are active & online

From: Stewart Smith
Date: Wed Feb 25 2015 - 18:13:28 EST


Michael Ellerman <mpe@xxxxxxxxxxxxxx> writes:

> Anton has a busy ppc64le KVM box where guests sometimes hit the infamous
> "kernel BUG at kernel/smpboot.c:134!" issue during boot:
>
> BUG_ON(td->cpu != smp_processor_id());
>
> Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
> output confirms it:
>
> CPU: 0
> Comm: watchdog/130
>
> The problem is that we aren't ensuring the CPU active bit is set for the
> secondary before allowing the master to continue on. The master unparks
> the secondary CPU's kthreads and the scheduler looks for a CPU to run
> on. It calls select_task_rq() and realises the suggested CPU is not in
> the cpus_allowed mask. It then ends up in select_fallback_rq(), and
> since the active bit isnt't set we choose some other CPU to run on.
>
> This seems to have been introduced by 6acbfb96976f "sched: Fix hotplug
> vs. set_cpus_allowed_ptr()", which changed from setting active before
> online to setting active after online. However that was in turn fixing a
> bug where other code assumed an active CPU was also online, so we can't
> just revert that fix.
>
> The simplest fix is just to spin waiting for both active & online to be
> set. We already have a barrier prior to set_cpu_online() (which also
> sets active), to ensure all other setup is completed before online &
> active are set.
>
> Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
> Signed-off-by: Michael Ellerman <mpe@xxxxxxxxxxxxxx>
> Signed-off-by: Anton Blanchard <anton@xxxxxxxxx>

By building a gcov enabled skiboot, which makes OPAL_START_CPU a whole
bunch slower (because gcov), I could really *really* reliably reproduce
this. With this patch, I cannot.

Tested-by: Stewart Smith <stewart@xxxxxxxxxxxxxxxxxx>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/