SCSI TIMEOUTS - EUREKA!

Leonard N. Zubkoff (lnz@dandelion.com)
Fri, 22 Mar 1996 12:02:59 -0800


It's been increasingly well documented that there are interrupt latency issues
with kernels from 1.3.73 onward. Since that time, there have also been many
reports of SCSI timeout problems with hardware that apparently was previously
working fine. What has not been at all clear is why interrupt latency should
affect commands that have timeouts of several seconds.

I'm still not sure why interrupt latency caused this timeout problem, but I
believe I've now figured out the source. It's a very old and subtle bug in the
SCSI timeout processing. Specifically, when a command completes, or any time
its timeout is to be canceled, the call "update_timeout(Command, 0)" is made.
Unfortunately, in some circumstances (used > 0, see below), instead of
canceling the timeout, this call signals one! Now the timeout won't actually
be processed unless the SCSI timer is active and an interrupt happens to occur
and is allowed at just the wrong time (i.e. before the timeout is reset again).
But if a timer interrupt does occur at the wrong time, a command that's just
completed successfully may be treated as though it timed out.

The following patch is untested (since I've not been able to reproduce this
particular lossage myself), but is simple enough I'm fairly certain of it.
What I can't be sure of is that it actually addresses the timeout problems
people are seeing, since it's always possible this isn't the only bug. Please
try this out and report back publicly on whether it helps or not.

--- linux/drivers/scsi/scsi.c- Tue Mar 19 00:00:49 1996
+++ linux/drivers/scsi/scsi.c Fri Mar 22 11:38:11 1996
@@ -2153,7 +2153,7 @@

if(SCset){
oldto = SCset->timeout - used;
- SCset->timeout = timeout + used;
+ SCset->timeout = timeout;
}

least = 0xffffffff;
@@ -2161,7 +2161,8 @@
for(host = scsi_hostlist; host; host = host->next)
for(SCpnt = host->host_queue; SCpnt; SCpnt = SCpnt->next)
if (SCpnt->timeout > 0) {
- SCpnt->timeout -= used;
+ if (SCpnt != SCset)
+ SCpnt->timeout -= used;
if(SCpnt->timeout <= 0) SCpnt->timeout = -1;
if(SCpnt->timeout > 0 && SCpnt->timeout < least)
least = SCpnt->timeout;

The one good thing about this exercise is that it's flushed out a number of
problems in the abort/reset handling code. I have partially completed fixes or
improvements for the abort/reset code which I will be reporting on soon, but I
wanted to get the above patch to people immediately since it will hopefully
solve the bulk of the immediate problems.

I hope someone else manages to determine for certain whether there are
interrupt latency issues or not, and figure out why.

Leonard