More mysterious crashes

Radovan Brako (radovan@thphys.irb.hr)
Tue, 17 Feb 1998 00:22:23 +0100 (MET)


Synopsis: We have been seeing "mysterious" crashes of Linux boxes
on a flat local Ethernet, during working hours only. We think they
are related to bursts of activity from W95 boxes, and occur on
machines with NE2000 clone NIC only. This may or may not be related
to other "mysterious crashes" reported here recently. I suspect
that bursts of UDP broadcasts to port 138 from W95 machines are
causing it. Solutions seem to include changing to other net cards
(SMC), or bridging. Any opinions ?

Detailed write-up: The Ethernet is a C-class size subnet, both
10Base2 and 10Base5, with several repeaters, not bridged except
for some peripheral segments, see below. Sudden lock-ups started
appearing recently on three Linux boxes. What is most strange, they
occured on very different architectures, processors ranging from
i486 to Cyrix, and kernel from 1.2.13 to 2.0.29 (and later to
2.0.33 when we upgraded thinking it may be teardrop attacks).
One of them was far apart from the other two, over a couple of
repeaters and a 10Base5 segment. The machines also had very different
levels of activities at the time of lock-ups, one was almost
completely idle, another was a dedicated nameserver, and yet another
a FTP server, the last two with moderete net activity. The machines
would block suddenly and without any trace in the logs or on the
console (which would sometimes remain frozen, but more often went
blank), at times simultaneously, and at other times one would freeze
and the other would stay up. After examining different possibilites,
I noticed that all three had NE2000 clone cards. Tcpdump logs showed
no sign of deliberate attacks, so I looked for broadcast activity.
Indeed, by doing the statistics on logs collected on machines which
survived and on other UNIX machines, I found out that the lock-ups
occured at the very second when there were bursts of several tens
of UDP broadcasts to port 138 (note: not 137), in all but one case.
(I analysed maybe 10 cases.) The bursts occur approximately once per
minute, and lock-ups did not always occur on bursts with largest
number of packets. I can't say whether intense collisions were
associated with those bursts; they may hev been, since the local net
has several hundred meters of coax cable total, and some BNC branches
are probably out of specs.

We did not investigate much further, since we found solutions:
The nameserver stopped crashing when we put it on the other side of
a bridge, and another machine seems to be doing well after we changed
the Ethernet card and put in SMC. (The bridge thing is interesting,
since UDP broadcasts should still come through, with different
timing of course, but collisions should not.) I am still interested
if this is a known problem, and what the cause is. It could be
hardware (in one case a lock-up seemed to have occured while the
machine was just starting to boot, but I'm not sure if a network
card can do this at all), driver, or other kernel networking code.
It seems unlikely that other faulty hardware is responsible (the
hardware is different on different machines, and crashes occur only
when a large number of Windows PC-s are active).

Any ideas ?

R. Brako

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu