Re: stop_machine lockup issue in 3.9.y.

From: Ben Greear
Date: Wed Jun 05 2013 - 16:58:46 EST


On 06/05/2013 12:31 PM, Ben Greear wrote:
This is no longer really about the module unlink, so changing
subject.

On 06/05/2013 12:11 PM, Ben Greear wrote:
On 06/05/2013 11:48 AM, Tejun Heo wrote:
Hello, Ben.

On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote:
One pattern I notice repeating for at least most of the hangs is that all but one
CPU thread has irqs disabled and is in state 2. But, there will be one thread
in state 1 that still has IRQs enabled and it is reported to be in soft-lockup
instead of hard-lockup. In 'sysrq l' it always shows some IRQ processing,
but typically that of the sysrq itself. I added printk that would always
print if the thread notices that smdata->state != curstate, and the soft-lockup
thread (cpu 2 below) never shows that message.

It sounds like one of the cpus get live-locked by IRQs. I can't tell
why the situation is made worse by other CPUs being tied up. Do you
ever see CPUs being live locked by IRQs during normal operation?

Hmm, wonder if I found it. I previously saw times where it appears
jiffies does not increment. __do_softirq has a break-out based on
jiffies timeout. Maybe that is failing to get us out of __do_softirq
in my lockup case because for whatever reason the system cannot update
jiffies in this case?

I added this (probably whitespace damaged) hack and now I have not been
able to reproduce the problem.

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 14d7758..621ea3b 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
int cpu;
unsigned long old_flags = current->flags;
+ unsigned long loops = 0;

/*
* Mask out PF_MEMALLOC s current task context is borrowed for the
@@ -241,6 +242,7 @@ restart:
unsigned int vec_nr = h - softirq_vec;
int prev_count = preempt_count();

+ loops++;
kstat_incr_softirqs_this_cpu(vec_nr);

trace_softirq_entry(vec_nr);
@@ -265,7 +267,7 @@ restart:

pending = local_softirq_pending();
if (pending) {
- if (time_before(jiffies, end) && !need_resched())
+ if (time_before(jiffies, end) && !need_resched() && (loops < 500))
goto restart;

wakeup_softirqd();

Thanks,
Ben

--
Ben Greear <greearb@xxxxxxxxxxxxxxx>
Candela Technologies Inc http://www.candelatech.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/