Processor stuck in smp_call_function (arch/i386/kernel/smp.c)

From: Steffen Persvold (sp@scali.com)
Date: Tue Nov 26 2002 - 19:33:51 EST


Dear kernel experts,

On a couple of Dual Xeon E7500 based machines (SuperMicro motherboards) we
have been experiencing frequent lockups running compute and I/O intensive
tasks. It came to a point where I patched the 2.4.20-rc2 kernel with kdb
v2.5 by Keith Owens trying to find out what was happening.

Now, when the systems become unresponsive I'm able to enter kdb and do a
back trace. It looks like this (trimmed down a bit to contain IMHO useful
info only) :

smp_call_function+0x83 (0xc01141e0, 0x0, 0x1, 0x1)
flush_tlb_all+0x14 ()
vmfree_area_pages+0x180 (0xf8c00000, 0x11000)
vfree+0x39 (0xf8a5e000)
release_segments+0x47 (0xf6435880)
exit_mmap+0x12 (0xf6435880)
mmput+0x5d (0xf6435880)
do_exit+0xd0 (0x200)

smp_call_function() is looping here :

        /* Wait for response */
        while (atomic_read(&data.started) != cpus)
                barrier();

So it seems a process is about to exit and the TLB is to be flushed.
However when the active cpu (cpu 0) waits for the flush_tlb_all_ipi()
function to start on cpu 1, it loops forever. In addition, trying to
switch to cpu 1 with kdb (with the 'cpu 1' command) results in an
'Invalid cpu number' error and I was told by Keith that this must be
because the other cpu hasn't responded to the kdb_ipi NMI.

It just looks like cpu 1 died for some reason, why ?

Has anyone experienced the same behaviour ?

I would really appreciate any input on this.

PS.

I've attached the output of dmesg and lspci, I hope it is helpful. As you
can see the system has a lot of IO-APICs, the reason is that this sytems
has fully equipped the E7500 MCH hub interfaces :

Hub Interface A : ICH3 (main APIC)
Hub Interface B, C, D : P64H2 (each with two PCI-X busses and APICs).

IOAPIC #2 is the APIC on the ICH3
IOAPIC #3 is the 1st APIC on Hub interface B
IOAPIC #4 is the 2nd APIC on Hub interface B
IOAPIC #5 is the 1st APIC on Hub interface C
IOAPIC #8 is the 2nd APIC on Hub interface C
IOAPIC #9 is the 1st APIC on Hub interface D
IOAPIC #10 is the 2nd APIC on Hub interface D

The PCI device where most of the IO is performed is located on the 2nd
APIC on Hub interface D.

DS

Thanks,

-- 
  Steffen Persvold   |       Scali AS      
 mailto:sp@scali.com |  http://www.scali.com
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY



- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Nov 30 2002 - 22:00:16 EST