2.4.24 SMP lockups

From: Simon Kirby
Date: Fri Jan 09 2004 - 16:07:18 EST


'lo all,

We've had about 6 cases of this now, across 4 separate boxes. Since
upgrading to 2.4.24, our SMP web server boxes (both Intel and AMD
hardware) are randomly blowing up. This may have happened on 2.4.23 as
well, but they weren't really running long enough to tell. 2.4.22 was
fine. GCC 3.3.3.

These boxes are all dual CPU, and the failure case shows up suddenly with
no warning. Sysreq-P works, but only reports from one CPU no matter how
many times I try. In normal operation, every machine distributes all
IRQs across both CPUs, and Sysreq-P reports from both CPUs.

Mapping the EIP reported by Sysreq-P to symbols shows that the responding
CPU is spinning on a spinlock (so far I have seen .text.lock.fcntl,
.text.lock.sched, .text.lock.locks, and .text.lock.inode), which I assume
is being held by the other (dead) CPU.

Even on boxes with nmi_watchdog=1, nothing is reported from the NMI
watchdog.

Simon-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/