Re: SCSI Kernel Problem - BAD

Eric Youngdale (eric@aib.com)
Fri, 15 Mar 1996 10:24:19 -0500


>If I don't have the swap on the scsi activated, then the system tends to
>mostly recover except that the cp process and anything else that tries
>to open the cdrom device afterwards gets hung, and eventually the system
>goes insane with kernel freelist corruption and crashes horribly anyway.

I saw one myself just the other day. Someone was copying large
files off of a cdrom, and putting onto disk. Something went wrong (all
messages were lost), and there was massive disk corruption. The /lib
directory was nuked. The passwd file was found instead in inetd.conf.
/bin/sh was a C program. Stuff like that. Good thing I keep a bootable
partition on an IDE disk, and it was also lucky that the system was more or
less just an image copy of the Red Hat 2.0 live cdrom (i.e. I could just
copy /lib back, and anything else that looked like it might have been
nuked).

Anyways, I really have no idea whatsoever what the problem is.
At the moment, I don't even have a lot of time to personally investigate.
My immediate inclination would be to blame something other than the scsi
code since I really haven't changed very much at all in the 1.3 series, and
it is
only in the past week or two that I have become aware of the possibility of
some pervasive problem (sure there have been problems, but if it is the NCR
driver, I tend not to pay much attention).

In my case, I have a 1542, and in this case, the abort/reset code
doesn't do much except send a few special commands to the 1542. Hmm,
perhaps
the thing goes slightly insane if there is a data transaction in progress at
the same time we attempt a reset. With the 1542, there is no tagged
queueing, so this isn't the explaination.

One thing for people to try - disable the reset() code for your
host adapter and see if the problem goes away. Someone else (Simon?)
reported that the problem went away when this was done, and I want to see
whether this is a universal truth or not. If this doesn't help, then
disable the abort() code and see whether this does the trick. If need be, I
can come up with a patch to make these kernel configuration options.

WRT the comments about tagged queueing, and the indeterminacy of
the order of operations, this is indeed worrying for a number of reasons.
Not the least of which is that if you are modifying lots of files, you could
conceivably have updates to the inode table consistently being deferred
because of large file writes. My gut feeling is that if we encounter
a timeout and tagged queueing is enabled, then instead of direct action,
we probably want to send one of those marker thingies to the device
to force it to synchronize - then we reset the timeout period to give
the command a chance to complete.

-Eric

-- 
"The woods are lovely, dark and deep.  But I have promises to keep,
And lines to code before I sleep, And lines to code before I sleep."