Re: NON IRQ DEADLOCK in 2.0.31

Rob Hagopian (hagopiar@vuser.vu.union.edu)
Fri, 14 Nov 1997 13:48:47 -0500 (EST)


Again... I believe that the software watchdog is designed to simulate a
real process, under the assumption that if a normal process can't start
fast enough, the machine must be deadlocked and should be rebooted. It
doesn't do well when swaping is so excessive that it can't start. I have
_no_ problems with this, in theory.

The problem is that users don't understand what's going on and try their
damaging actions over and over. Yes, this is where process limits come in.
I should. I haven't. It's irrelevant. I haven't had to reboot the machine
in over a month now (although now that we have snow the power failures
will now doubt start :-( ).

The web server OTOH died with a DEADLOCK and didn't reboot. Yes, I'm
leaving myself open to vulnerabilities by not having a hardware watchdog,
but I can live with that. 99.44% of the time the kernel knows when bad
things are happening. Most of those are panics. I myself haven't had a bad
freeze in a long time (good hardware is a good thing). However, this
DEADLOCK is a problem, esp on SMP machines.

Back to the original problem, the machine was grinding itself into the
ground this morning. There were a LOT of validating probes on the screen
(and scrolling). There were 2 device errors, one for each IDE hard drive
(4.3G Caviar). There was one device not ready error 03:03. The machine
would respond to pings and tried to open connections, that's about it.
Again, it was in the middle of another tape backup from another machine
via NFS. The situation seemed very much like out of memory, but I couldn't
prove it and didn't have the time to deal with it, so I pulled the plug.
Nothing in the logs.

Is there something about NFS that I should know about? Our other machine
(PPro 180 clocked to 200) doesn't seem to have a problem with the backups,
but it's not serving 100+ web pages at the same time. A conflict with
cookies maybe?
-Rob H.

On Fri, 14 Nov 1997, Mike Jagdis wrote:

> On Fri, 14 Nov 1997, Christophe Dupre wrote:
>
> > On Fri, 14 Nov 1997, Rob Hagopian wrote:
> >
> > > I've actually had some problems with the software watchdog:
> > > A user has a 150MB email, tries to open it in pine, pine goes hogwild
> > > with memory (swap), the computer slows to a crawl, the watchdog process
> > > doesn't get spawned fast enough, the computer resets.
> >
> > How can a user get a 150 MB EMail ? Gross..... Anyway, in this case even
> > with a hardware watchdog, this machine would reset, since the hardware
> > watchdog counts on a process (softdog) to reset to counter on a regular
> > basis. Regarding the software watchdog, I'm sure you could go and modify
> > the timer and raise it to a few minutes...... Just go and modify
> > /usr/src/linux/driver/char/softdog.c ...
>
> Perhaps I'm being dense but shouldn't a software watchdog
> be running with a SCHED_FIFO scheduling class and mlock
> it's important pages? You really don't want it to reboot
> your machine just because it's a bit stressed doing work!
>
> Mike
>
> --
> .----------------------------------------------------------------------.
> | Mike Jagdis | Internet: mailto:mike@roan.co.uk |
> | Roan Technology Ltd. | |
> | 54A Peach Street, Wokingham | Telephone: +44 118 989 0403 |
> | RG40 1XG, ENGLAND | Fax: +44 118 989 1195 |
> `----------------------------------------------------------------------'
>