Re: Ok, I give up...

Jelle Foks (jelle@flying.demon.nl)
Sun, 19 Dec 1999 22:24:40 +0100 (AMT)


Hello,

Just some near random thoughts that might help you here...

Are you logging the syslog on remote machines (@hostname entries in the
syslogd.conf)? I've found in the past that that helps against the loss of
log file tails.

Are the boxes SMP, and are the disks maybe in (IDE) PIO mode sometimes? If
that is possible, then you should try the 14pre15, because Alan has
recently fixed a lockup bug for that case.

Did you use kernels with minimal hardware support compiled-in (Just
the minimum what you need, from the hardware that is present, and without
using kernel modules and kerneld)?

Did you try shuffling around with the hardware, different network cards,
different mainboard/CPU combinations, other/new DIMMs, different power
supplies (hooked up to different power groups, different/newer UPS-es)),
to see when the problem gets worse or less (might help find the problem).
Even though it used to work perfectly, the PC hardware could have become
flakey since you ran the 2.0.x kernels (CPU fans are not the only things
that can break down), and the power net supply and/or UPS (110V/220V) can
have become worse, which can also have an influence on system stability.

Also, are you running X11 on the servers (if so, I suggest you stop doing
that, the XFree86 binaries have direct access to all hardware, so a bug in
that binary can wreak havoc, and in addition VGA cards can lockup the PC
hardware in some cases)? Id even suggest running the PCs without video
cards (dont forget to compile a new kernel without console support, or you
will be left with a significant slowdown of your server).

Good luck finding the problem,

Jelle.

On Sun, 19 Dec 1999, David Edwards wrote:

> About 2 months ago, I moved all of my servers ( 15 of them) to the 2.2.x
> kernels. Some were clean installs of RH 6.0, some were upgrades to RH
> 5.2. But all of these servers ran 2.0.35 and 2.0.36 with 100+ days of
> uptime and were rock solid. But since moving to the 2.2 kernels on the
> same hardware, reliability and uptime sucks. Seems like I can rarely get a
> month of uptime with the 2.2 kernels, and I've tried everything from 2.2.5
> to 2.2.14pre13. The few oopses I've had have been traced back to buggy
> hardware that has since been replaced. But in most every case with the 2.2
> kernels, the servers (mainly serving web pages) run for a few days to a
> week and then lock up completely. Then it requires a power cycle to bring
> it back to life.
>
> So I'm stumped. When a system locks up I don't get any console messages or
> anything in syslog. Since all of these servers ran 2.0.x kernels and were
> very stable for months, I'm inclined to think the hardware is fine. So any
> clues as to how I might track this down? I'd hate to say it (flame jacket
> on) but my NT and freeBSD boxes are putting Linux to shame in terms of
> stability....
>
> David
>
> Please read the FAQ at http://www.tux.org/lkml/
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/