Re: 2.4.24 SMP lockups

From: Andrew Morton
Date: Sat Jan 10 2004 - 17:42:00 EST


Marcelo Tosatti <marcelo.tosatti@xxxxxxxxxxxx> wrote:
>
>
>
> On Fri, 9 Jan 2004, Simon Kirby wrote:
>
> > 'lo all,
>
> Hi Simon,
>
> > We've had about 6 cases of this now, across 4 separate boxes. Since
> > upgrading to 2.4.24, our SMP web server boxes (both Intel and AMD
> > hardware) are randomly blowing up. This may have happened on 2.4.23 as
> > well, but they weren't really running long enough to tell. 2.4.22 was
> > fine. GCC 3.3.3.
> >
> > These boxes are all dual CPU, and the failure case shows up suddenly with
> > no warning. Sysreq-P works, but only reports from one CPU no matter how
> > many times I try. In normal operation, every machine distributes all
> > IRQs across both CPUs, and Sysreq-P reports from both CPUs.
> >
> > Mapping the EIP reported by Sysreq-P to symbols shows that the responding
> > CPU is spinning on a spinlock (so far I have seen .text.lock.fcntl,
> > .text.lock.sched, .text.lock.locks, and .text.lock.inode), which I assume
> > is being held by the other (dead) CPU.
>
> This sounds like a deadlock. I wonder why the NMI watchdog is not
> triggering.

Presumably it's spinning on the lock with interrupts enabled. Make that
the `NMI' counters in /proc/interrupts are incrementing for all CPUs.


> > Even on boxes with nmi_watchdog=1, nothing is reported from the NMI
> > watchdog.
>
> Can you share all available SysRQ-P output for the locked CPU ? SysRQ-T if
> possible, too.

sysrq-T would be best.

We don't have an each-CPU backtrace facility - it could be handy. There's
one in the low-latency patch for some reason.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/