Re: SCSI Kernel Problem - BAD

Leonard N. Zubkoff (lnz@dandelion.com)
Mon, 18 Mar 1996 22:16:47 -0800


From: "Eric Youngdale" <eric@aib.com>
Date: Mon, 18 Mar 1996 19:19:29 -0500

I have something I would like people to try if they have been
experiencing data corruption problems with SCSI. The patch is pretty
simple, it just disables resets being sent because of timeout - someone
else tried something like this and found that it helped.

As I recall, it helped avoiding system hangs, but there wasn't any data
corruption involved.

Note that there may be error conditions which will no longer be
recoverable with this set, but I am not sure how often this comes up.
Right now I would like to know if this improves system reliability.

It comes up much more often than you might think. Since I implemented
resetting in the BusLogic driver, I've had many reports where the timeout based
reset code allowed the system to recover from error conditions that formerly
led to system hangs. It was Simon Shapiro who reported some of the worst
problems recently, and he now has some alpha test changes to my BusLogic driver
and to scsi.[ch]. I think I've handled all the race conditions and other reset
anomalies I described recently with the exception of the one I described as a
"paired reset". I'm hopeful he will be able to reproduce this problem and
gather enough debugging information for me to understand what's really
happening. Despite trying all the tests he's recommended, I've been unable to
cause the same sort of lossage he sees. I've been generating lots of timeouts
and resets, but I've seen the paired reset death only a very few times, and
never with enough information to determine what's happening.

There have been other reports of problems with recent 1.3.x kernels and I
suspect there's a common thread here that's not yet understood. If interrupt
latencies have actually gone through the roof as someone hypothesized, that
could account for some of the timeout problems.

Leonard