Re: [ANNOUNCE] 3.0.1-rt11

From: Thomas Gleixner
Date: Wed Sep 07 2011 - 12:00:59 EST


On Tue, 6 Sep 2011, Frank Rowand wrote:

> On 08/26/11 16:55, Paul E. McKenney wrote:
> > On Wed, Aug 24, 2011 at 04:58:49PM -0700, Frank Rowand wrote:
> >> On 08/13/11 03:53, Peter Zijlstra wrote:
> >>>
> >>> Whee, I can skip release announcements too!
> >>>
> >>> So no the subject ain't no mistake its not, 3.0.1-rt11 is there for the
> >>> grabs.
>
> < snip >
>
> >> I have a consistent (every boot) hang on boot. With a few
> >> hacks to get console output, I get the
> >>
> >> rcu_preempt_state detected stalls on CPUs/tasks
>
> < snip >
>
> >> This is an ARM NaviEngine (out of tree, so I also have applied
> >> a series of pages for platform support).
> >>
> >> CONFIG_PREEMPT_RT_FULL is set. Full config is attached.
>
> I have also replicated the problem on the ARM RealView (in tree) and
> without the RT patches.
>
> >
> > Hmmm... The last few that I have seen that looked like this were
> > due to my messing up rcutorture so that the RCU-boost testing kthreads
> > ran CPU-bound at real-time priority.
> >
> > Is it possible that something similar is happening on your system?
> >
> > Thanx, Paul
>
> The problem ended up being caused by the allowed cpus mask being set
> to all possible cpus for the ksoftirqd on the secondary processors.
> So the RCU softirq was never executing on cpu 2.
>
> I'll test the following patch on 3.1 tomorrow.
>
> -Frank Rowand
>
>
> Symptom: rcu stall
>
> The problem was that ksoftirqd was woken on the secondary processors before
> the secondary processors were online. This led to allowed cpus being set
> to all cpus.
>
> wake_up_process()
> try_to_wake_up()
> select_task_rq()
> if (... || !cpu_online(cpu))
> select_fallback_rq(task_cpu(p), p)
> ...
> /* No more Mr. Nice Guy. */
> dest_cpu = cpuset_cpus_allowed_fallback(p)
> do_set_cpus_allowed(p, cpu_possible_mask)
> # Thus ksoftirqd can now run on any cpu...

This smells badly like the problem we've seen on x86 before. And
looking at the arm SMP boot code:

asmlinkage void __cpuinit secondary_start_kernel(void)
{
.....

/*
* Give the platform a chance to do its own initialisation.
*/
platform_secondary_init(cpu);

/*
* Enable local interrupts.
*/
notify_cpu_starting(cpu);
local_irq_enable();

Here we enable interrupts, but the CPU is neither online nor active.

local_fiq_enable();

/*
* Setup the percpu timer for this CPU.
*/
percpu_timer_setup();

calibrate_delay();

smp_store_cpu_info(cpu);

/*
* OK, now it's safe to let the boot CPU continue. Wait for
* the CPU migration code to notice that the CPU is online
* before we continue.
*/
set_cpu_online(cpu, true);
while (!cpu_active(cpu))
cpu_relax();

That's the same thing as x86 is doing, just with interrupts enabled
and therefor it does not help. And the softirq is only part of the
problem, the same can happen with worker threads and other cpu bound
nasties.

/*
* OK, it's off to the idle thread for us
*/
cpu_idle();
}

So that wants to be ordered differently. Patch below.

Thanks,

tglx

Index: linux-2.6/arch/arm/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/arm/kernel/smp.c
+++ linux-2.6/arch/arm/kernel/smp.c
@@ -305,6 +305,16 @@ asmlinkage void __cpuinit secondary_star
* Enable local interrupts.
*/
notify_cpu_starting(cpu);
+
+ /*
+ * OK, now it's safe to let the boot CPU continue. Wait for
+ * the CPU migration code to notice that the CPU is online
+ * before we continue.
+ */
+ set_cpu_online(cpu, true);
+ while (!cpu_active(cpu))
+ cpu_relax();
+
local_irq_enable();
local_fiq_enable();

@@ -318,15 +328,6 @@ asmlinkage void __cpuinit secondary_star
smp_store_cpu_info(cpu);

/*
- * OK, now it's safe to let the boot CPU continue. Wait for
- * the CPU migration code to notice that the CPU is online
- * before we continue.
- */
- set_cpu_online(cpu, true);
- while (!cpu_active(cpu))
- cpu_relax();
-
- /*
* OK, it's off to the idle thread for us
*/
cpu_idle();

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/