Re: 2.4.24 SMP lockups

From: Marcelo Tosatti
Date: Sat Jan 10 2004 - 15:07:43 EST



---------- Forwarded message ----------
Date: Sat, 10 Jan 2004 17:32:55 -0200 (BRST)
From: Marcelo Tosatti <marcelo.tosatti@xxxxxxxxxxxx>
To: Simon Kirby <sim@xxxxxxxxxxxxx>, Andrew Morton <akpm@xxxxxxxx>
Subject: Re: 2.4.24 SMP lockups



On Fri, 9 Jan 2004, Simon Kirby wrote:

> 'lo all,

Hi Simon,

> We've had about 6 cases of this now, across 4 separate boxes. Since
> upgrading to 2.4.24, our SMP web server boxes (both Intel and AMD
> hardware) are randomly blowing up. This may have happened on 2.4.23 as
> well, but they weren't really running long enough to tell. 2.4.22 was
> fine. GCC 3.3.3.
>
> These boxes are all dual CPU, and the failure case shows up suddenly with
> no warning. Sysreq-P works, but only reports from one CPU no matter how
> many times I try. In normal operation, every machine distributes all
> IRQs across both CPUs, and Sysreq-P reports from both CPUs.
>
> Mapping the EIP reported by Sysreq-P to symbols shows that the responding
> CPU is spinning on a spinlock (so far I have seen .text.lock.fcntl,
> .text.lock.sched, .text.lock.locks, and .text.lock.inode), which I assume
> is being held by the other (dead) CPU.

This sounds like a deadlock. I wonder why the NMI watchdog is not
triggering.

> Even on boxes with nmi_watchdog=1, nothing is reported from the NMI
> watchdog.

Can you share all available SysRQ-P output for the locked CPU ? SysRQ-T if
possible, too.

Can you please describe the hardware in more detail. Is there any common
hardware used in these boxes?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/