Re: SMP kernel lockup, 2.2.14 and 2.2.15pre1

From: Robert Cohen (robert@apex.net.au)
Date: Thu Mar 23 2000 - 20:23:43 EST


This sounds supercially at least very similar to the freezes that I am
seeing on a squid box.
At least as far as the freezing/sysrq behaviour anyway.

See my recent postings in linux-kernel for more details, anyway.
However, my freezes are on UP.

I suspect that mine is stable if I compile with gcc-2.7.2. At least I
havent had a crash.
But since its non deterministic, it could be coincidence.

Anyway, I was able to get a full oops log with the IKD patch and a
serial console.
You can find the IKD patch on kernel.org in andreas directory.
I applies cleanly to 2.2.14 and with some fiddling can be applied to
2.2.15pre15.

Patrick J. LoPresti) wrote
>
> We are able to consistently wedge 2.2.14 and 2.2.15pre15 on SMP
> systems using only user-mode processes with modest resource
> requirements.
>
> Unfortunately, this lockup involves a commercial product (Perforce),
> so we may need to debug it ourselves. This message is a request for
> help.
>
> Here is the deal:
>
> We are converting an internal source code repository from CVS to
> Perforce. We have a conversion script which takes about 9 hours to
> run. All processes involved (script, client, and server) are running
> as a normal user, not as root.
>
> About 5-8 hours into the conversion, we usually (about 80% of the
> time) see the machine hang. It does not respond to pings nor to any
> activity on the keyboard (e.g., ctrl-alt-del). The magic sysrq key
> works, however; see below.
>
> We have now reproduced this on three different SMP systems, with
> different hardware configurations, each while running 2.2.14 compiled
> by egcs-1.1.2 and while running 2.2.15pre15 compiled by gcc-2.95.2.
> (I did not realize until recently that someone installed 2.95.2 as our
> default compiler; sorry for the inconsistency.) So this is pretty
> clearly not a hardware problem.
>
> We have magic sysrq support compiled into 2.2.15pre15. While the
> machine was locked, we pressed "alt-sysrq-p" repeatedly and wrote down
> the various EIP values. Mapping those by hand to kernel addresses
> (using "nm vmlinux"), we determined that we are spending some portion
> of our time in tcp_delack_timer, tcp_send_delayed_ack, add_timer, and
> timer_bh. (This appears to be a loop waiting for a socket to become
> unlocked.) Some of the EIP values had no obvious entry in the symbol
> table; I am guessing they are in dynamically loaded modules.
>
> Alt-sysrq-s and Alt-sysrq-u each print the message saying what they
> are trying to do, but no further output. We see no messages about
> specific disks being synced; no "OK" for the unmount; nothing. So the
> kernel does appear to be locked pretty solid.
>
> Finally, we are running the same experiment on a uniprocessor box and
> on 2.2.13 SMP, neither of which has deadlocked yet. We have only done
> one or two such trials so far, however. They are ongoing.
>
>
>
> -
> T
>

--
Robert Cohen - Network Administrator      
Apex Internet     
robert@apex.net.au     http://www.apex.net.au
Ph (02) 6247 2000      Fax: (02) 6247 2711

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Mar 23 2000 - 21:00:39 EST