Random crashes

Behan Webster (behanw@verisim.com)
Fri, 28 Nov 1997 14:51:17 -0500


We've been having troubles with our computers, in that they have a
tendency to reboot or lock under certain circumstances.

Most of our workstations will lock at the end of a big compile. If we
compile a really large library, the compile will finish, and then it
seems as it writes the last bit out to the harddrive, the whole system
locks.

Other times, the scsi light will come on, linux will continue to run,
but as soon as a process tries to access the harddrive it locks. Pretty
soon every process is waiting on the disk.

We also have times when the screen just freezes, and little dots or
lines will appear at the top or bottom of the screen.

Also, one of our servers will lock every time when we run amanda on it
(writing out to a DAT tape). In this case we actually get error
messages on the console:

scsi0 channel 0 : resetting for second half of retries.
SCSI bus is being reset for host 0 channel 0.
eata_reset called pid:37553 target: 0 lun: 0 reason 0
eata_reset: slot 13 in reset, pid 37555.
eata_reset: board reset done, enabling interrupts.
eata_reset: interrupts disabled again.
eata_reset: slot 13 locked, DID_RESET, pid 3755 done.
eata_reset: exit, wakeup.
eata_dma: int_handler, reseted command pid 37555 returned
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 27000002
scsidisk I/O error: dev 08:03, sector 274568
Kernel panic: scsi_free:Trying to free unused memory
In swapper task - not syncing

All of these symptomes can only be cured by a hard reset.

Has anyone seen this kind of problem before? Anyone have any ideas what
to try? We've been banging away on this problem for 2-3 months. We've
tried different Linux kernels, dirrenent BIOS settings, reseating all
the boards, new SIMMS, even a new (but same type of) motherboard. All
to no avail.

The only commonality that we can see is that the systems that are
affected seem to be PCI based. (the scsi controller and the video card).

Is there any possibility that there is a race condition in the linux
kernel PCI code? I'm grasping at straws here. Sorry.

Actually, we have some more of these computers at another site. These
computers are running NT and also exibit random crashes. I understand
that installing service pack 3 has largely fixed their problems though.
(i.e. NT crashes fewer times now) 8)

The linux machines are Tyan dual-pentium II motherboards (some with 1
some with 2 processors) with DPT 2044UW scsi controllers and PCI video
cards.

We did find that we crashed fewer times when we moved from the 2.0.x
kernels to the 2.1.x kernels. Currently we are running the 2.1.64
kernels on all our machines. This kernel seems to have made certain of
our computers more stable and others less stable. 8)

The NT boxen use the same motherboard, but an Adaptec AHA2940UW scsi
controller instead.

At this point we suspect a PCI hardware problem on the motherboard, but
that's very hard to prove. I was just wondering whether it might be a
software problem in the linux kernel instead.

Any help is very appreciated.
Thanks,

Behan

-- 
Behan Webster     mailto:behanw@verisim.com
+1-613-224-7547   http://www.verisim.com/