Re: Linux can't stay up for more than an hour?

Ben 'The Con Man' Kahn (xkahn@cybersites.com)
Thu, 6 May 1999 01:28:45 -0400 (EDT)


On Wed, 5 May 1999, Ted Rolle wrote:

> If you switch to BSD, I'd like to know which flavor (OpenBSD, FreeBSD,
> ...) and your reason for selecting the target.
>
> There has been no replies to this post so far...

We would probably choose FreeBSD. Why? Oh. Um... I'm not sure.
Our sys admin has a lot of experience with FreeBSD and enjoys using it.
Personally, I would choose OpenBSD for security, but...

Anyway, this is getting off topic. Plenty of people have replied
to me. I hope no one minds if I post some of the replies. Please note
that we've been having this problem for about two months. By moving to a
server pool system, we were able to reduce the frequency of the crashes.
(The front end server never crashed... The backend Apache servers would
crash once a week or so.) We recently moved AWAY from the server pool and
back to a single server because we were moving the hardware to other
sites.

Anyway, the replies:

----------------------------------------------------
From: Mike Galbraith <mikeg@weiden.de>

Hi,

(this suggestion sucks for a production environment, but if noone...)

You could try the IKD patch and enable SMP-IOAPIC NMI Software Watchdog.
In the event of a lockup, the kernel should produce a nice oops showing
where the thing is hung. If you also enable the kernel tracer, you'll
receive a more detailed stack trace, but it may become long enough that
serial-line logging may be needed to capture the entire trace. (logging
over a serial port saves you copying the oops from the screen anyway:)

The patch is located on e-mind.com in /pub/andrea/ikd.

-Mike

P.S. don't _even_ try to use an egcs snapshot with ikd.. fastcall
handling changes make it go KABOOOOOM!!! :)
----------------------------------------------------

It's a production environment, and we need this up all the time.
But that's great advice. As a matter of fact, the SMP systems crashed FAR
more often than the single hosts. (But when we moved all the traffic to
the one single CPU host it crashed under high load.)

----------------------------------------------------
From: Keith Owens <kaos@ocs.com.au>

Odds are you are getting some diagnostics, just not seeing them. The
first thing I would do is run with a serial console (see
Documentation/serial-console.txt), send the output to a second machine
and capture the output there.
----------------------------------------------------

Yup. More diags would be good. Now we just need a terminal
server... Well, it'll ahve to wait for a few demos to pass... *sigh*

And Alan Cox thought it was our VERY heavy use of NFS. We think
he's right... There was a theory that it was SMP with heavy NFS access
that was the problem. I think SMP exposes problems with knfs...

I'd imagine that running a tcpdump process of a seperate host
which monitors traffic between the NFS server and the client. Then we
just have to see what the traffic is at the moment of the crash and...
Well, anyway, it'll be a little while until we can do that.

-Ben

------------------------------------ |\ _,,,--,,_ ,) ----------
Benjamin Kahn /,`.-'`' -, ;-;;'
(212) 924 - 2220 |,4- ) )-,_ ) /\
ben@cybersites.com --------------- '---''(_/--' (_/-' ---------------
Drawing on my fine command of language, I said nothing.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/