[(patch?)] W95 can handle SCSI errors - Linux can't

Andries.Brouwer@cwi.nl
Tue, 18 Aug 1998 00:20:22 +0200 (MET DST)


Executive summary:
- Reading a bad CD-ROM on a SCSI drive kills 2.0.33 entirely.
- 2.1.115 survives, but can no longer access the CD-ROM drive.
- Win95 mentions `I/O error' and happily reads the next CD-ROM.
Suggestions and perhaps patches below.

A few days ago I set up a SuSE 5.1 system on an old 486/80
with Adaptec 1542 SCSI card. Ah, such a lot of bad memories!
Reading a bad CDROM (or getting a SCSI error in any other way)
would always crash the system in a loop of
SCSI bus is being reset ...
Sent BUS DEVICE RESET ...
Sending DID_RESET ...
aha1542_intr_handle: Unexpected interrupt ...
etc.
I tried the 2.0.33 kernel that came with SuSE 5.1, and it
died in precisely this way. An unclean reboot is the only way out.

Then I tried Windows 95, which was already installed on that machine,
and it said: Cannot read file, and was otherwise unimpressed. Hmm.
We can learn something. But have we learnt already?

I tried 2.1.115, and it survived, more or less, but rather inelegantly.

One bug is shown by the kernel message
negative putlock from ...
upon use of a good CDROM after the bad one.
The reason is that scsi_allocate_device() expects to be called
while io_request_lock is held, but sr_do_ioctl() calls it without
first claiming this lock.
Maybe a patch like

--- /g2/linux/linux-2.1.115/linux/drivers/scsi/sr_ioctl.c Tue Aug 4 20:50:51 1998
+++ ./sr_ioctl.c Mon Aug 17 21:22:46 1998
@@ -56,7 +56,10 @@
unsigned long flags;

SDev = scsi_CDs[target].device;
- SCpnt = scsi_allocate_device(NULL, scsi_CDs[target].device, 1);
+
+ spin_lock_irqsave(&io_request_lock, flags);
+ SCpnt = scsi_allocate_device(NULL, SDev, 1);
+ spin_unlock_irqrestore(&io_request_lock, flags);

retry:
if( !scsi_block_when_processing_errors(SDev) )

would be an improvement. At least it avoids the negative putlock.
But now in scsi_allocate_device() the code

if (wait && SCwait && SCwait->request.rq_status != RQ_INACTIVE) {
spin_unlock(&io_request_lock); /* FIXME!!!! */
sleep_on(&device->device_wait);
spin_lock_irq(&io_request_lock); /* FIXME!!!! */

waits forever, since the scsi_error handling left rq_status as
RQ_SCSI_BUSY, but (most likely) nothing will ever happen anymore.
Is this a bug in scsi_error.c?

So, what I am pointing out is first of all that the source code
is logically incorrect. Probably we need a file with explicit
documentation of all assumptions about the various states the
system can be in, and the values the variables can have.
I would like to be able to check mechanically that assumptions
are verified.

But in the second place, apart from the fact that the code is
logically buggy, the code is technically stupid.
(At least up to and including 2.0.33 we have: Bus reset? Worse than panic.
And also in 2.1.115 the scsi_error code loves to do bus resets.
Win95 does no such thing. If my scanner has problems, and my
disks are working fine, there is no reason I can see to do a
hard bus reset followed by a hard host reset, like the present
code does.

One might for example consider code like
(untested, not even compiled):
----------------------------------------------------------------
rtn = scsi_try_bus_device_reset(SCloop, RESET_TIMEOUT);

if( rtn == SUCCESS ) {
rtn = scsi_test_unit_ready(SCloop);
if( rtn == SUCCESS && scsi_unit_is_ready(SCloop) ) {
rtn = scsi_eh_retry_command(SCloop);
if( rtn == SUCCESS ) {
SCloop->host->host_failed--;
scsi_eh_finish_command(&SCdone,SCloop);
}
}
/*
* The bus device reset succeeded, but the unit is not ready.
* Better not start resetting the bus and kill the system.
* Certainly it must be a mistake to do a hard bus reset
* and a hard host reset while disk I/O is fine.
* Maybe this is a CDROM drive with a bad CD -- aeb.
*/
if( host->host_failed ) {
SCloop->host->host_failed--;
scsi_eh_finish_command(&SCdone,SCloop);
ourrtn = FALSE;
goto leave;
}
}
}
----------------------------------------------------------------
(don't take this particular code too seriously - I just want to
show that at this point: bus device reset failed, we must not
blindly continue with a bus reset).

Finally, after doing all that it can think of, the scsi_error
code takes the device off-line. But in cases of removable
devices we would like it to come back on line when it is
exchanged. One bad CD-ROM does not mean the drive is bad.

[After typing all this nonsense I tried the above code;
it was partially successful in that it survived the first
bad read, but died on the third or fourth with
panic("Unable to find empty mailbox for aha1542.\n");
. So, some more cleanup is required in the error_handler.]

Andries

P.S. Is detailed information on the Adaptec 1542 available today?
For example, it would be nice to have an abort routine that
really worked.

P.S. 2: Quite often a semaphore is allocated on the stack,
and a pointer to it is left in some structure after the
stack frame has become invalid. I cannot see places where
this might cause errors, but it is a bad practice,
and might cause mysterious errors.
E.g., one should add
SCpnt->request.sem = NULL;
in sr.c.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html