Help! Everything is crashing!

Bjarni R. Einarsson (bre@margmidlun.is)
Thu, 26 Mar 1998 10:09:15 +0000


Did that topic get your attention?

I think there is a new teardrop-like bug in the the 2.0.x kernels. The
reason I think so follows..

I'm the sysadmin for an ISP in Iceland, where we base almost all of our work
on Linux. We're running the following production servers:

+ Mail server: Intel Pentium, 3com, AIC-7881Ur0 SCSI (1 disk).
+ Proxy server: Intel Pentium, 3com, AIC-7881Ur0 SCSI (4 disks).
+ 3 dialin boxen, Intel/cyrix, 3com, 1 or 2 32 port cyclades boxes each.
+ Web server: Cyrix or AMD(?), 3com.
+ Masq/backup box: Cyrix or AMD(?), 3com, NCR 53c810 SCSI (1 tape).

These boxes were, for a long time, all running the same kernel (2.0.32pre6
with 3000 FD patch). The backup box still IS running this kernel, with 105
days uptime, the web server 38 days and the mail server 20 days. Both the
mail server and the web server crashed last time they went down.

The rest haven't been so stable. Until last week the only real problems
were with the dialin boxes (which also masquerade most of my customers, so
they are highly visible on IRC). Then, after 34 days of cool running, the
proxy server started to die.

It doesn't die randomly.. it dies in the morning, when there are no cron
jobs running, and very few users connected. This has happened 3 days out of
the past four now..

The dialin servers also died at predictable times, late at night on
weekdays, or in the afternoons on weekends.

Freaky? It's mostly the timing that has me thinking this is an attack.
That, and the fact that these boxes used to be very stable, running the
exact same setups.

When I go to work to get the machines on their feet again, either the
consoles are locked tight (blank screen) or there are numerous identical
error messages accross the screen. The proxy server dies complaining about
eth0 and memory (I'll write it down next time I see it), and the dialin
boxes would go "Aiee", saying something about scheduling and shutting down
the interrupt handler. I very rarely see an oops.. when I do it is one of
many, with the first long scrolled to oblivion. The logs tell me nothing.

I know this info is very vague.. but it's all I've got at the moment. I'll
take notes next time.

Now I've downgraded my dialin boxes to 2.0.29 with the teardrop fix, and the
proxy server to 2.0.29 with the 3000fd patch (I'm running Squid NOVM) and
the 2.0.30/31/32 patches without the networking changes that were on to
ftp.uk.linux.org.

Since downgrading the dialin boxes (which I did 11 days ago), two of the
three have been fine (surviving one weekend), but the third locked up 4 days
ago. This is the only box I suspect may have flakey hardware (a slightly
iffy Cyclades adapter). All the other hardware in here has given me well
over 30 days of uptime.. sometimes much more.

One other thing.. most of my servers are based on Redhat 3.0.3, where the
only upgrades that have been done were because of security bugs. Can I
expect upgrading to RedHat 5.0 or 4.2 to make things more stable?

I would really like to find this bug, but I'm not much of a kernel hacker
and don't know where to start. Help?

Any advice is very welcome, but please keep in mind that this IS a
production environment and I can't really afford to expirament very much..

If things keep crashing like this, I'll be forced to consider another
platform for my servers (*BSD comes to mind). :-(

-- 
Bjarni R. Einarsson
 bre@margmidlun.is               [ THIS SPACE INTENTIONALLY LEFT BLANK ]
 http://www.mmedia.is/~bre
 Juggler@IRC                           DO NOT PULL ON YELLOW TIP!
 

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu