Re: SCSI Kernel Problem - BAD

Steven S. Dick (ssd@nevets.oau.org)
Thu, 14 Mar 96 04:45 EST


Just to add my noise/data point to this discussion...
My system also will crash during large block SCSI I/O.

My configuration:
external SCSI 2 cdrom, 2x spin
486sx25 (ya, it's slow)
Adaptec 1542CF
NEC SCSI hard drive mounted on /src
root partition on Maxtor IDE hard drive
kernel 1.3.71

If I mount the slackware linux cd, and run
cp -Rv /cdrom/slakware /src/slakware
it always crashes before it finishes the X disks.
Sometimes it crashes as soon as the a disks, usually it crashes
somewhere in the N or T disks when it's doing one file per disk...

Sometimes it will crash just by doing `zcat /cdrom/...| tar tvf - > /src/index`,
and doesn't even need the cp. I haven't played with it enough
to see if it also crashes when writing to the IDE drive.

The cp problem is fairly repeatable.
Causing it with tar is less repeatable.

It *always* crashes while playing with the largest files. The system
kind of freezes just before the first 'timeout' error... Even console
switching is locked up until the timeout message is printed. What's
interesting is that I usually get between three and ten timeout errors /
crom drive hard resets followed by a couple of scsi bus resets. The CD
continues to indicate that it is actively reading data until after the
_LAST_ scsi bus reset, at which point it spins down the disk and does a
physical reset of the drive mechanism. (It sounds like half of the
power up noise--like the noise it makes just before ejecting the disk.)

If I have the swap space on the scsi drive enabled at the time,
just about every process in the whole system will get killed, and
I have to reset.

If I don't have the swap on the scsi activated, then the system tends to
mostly recover except that the cp process and anything else that tries
to open the cdrom device afterwards gets hung, and eventually the system
goes insane with kernel freelist corruption and crashes horribly anyway.

Something associated with that timeout code or the code triggered
by the timeout code corrupts kernel memory all over the place.

More details available upon request.
I even have some stack traces from freelist corruption--but I'm not
entirely sure that they'll be useful--too far from the actual problem...
and I'll have to decode the addresses by hand (not much of a problem).

Steve
ssd@nevets.oau.org