2.3.29 fails to recover from SCSI errors

Neil Brown (neilb@cse.unsw.edu.au)
Fri, 3 Dec 1999 11:58:43 +1100 (EST)


Hi,
I am having problems with 2.3.29 (and 28 atleast) with SCSI errors
causing the machine to lock up.

After a bunch of errors like:

Dec 3 11:30:59 glass kernel: scsi : aborting command due to timeout : pid 188131, scsi0, channel 0, id 1, lun 0 Read (10) 00 00 37 3f 53 00 00 02 00
Dec 3 11:30:59 glass kernel: scsi : aborting command due to timeout : pid 188136, scsi0, channel 0, id 1, lun 0 Read (6) 0e 86 d3 10 00
Dec 3 11:30:59 glass kernel: scsi : aborting command due to timeout : pid 188134, scsi0, channel 0, id 1, lun 0 Write (6) 01 3f 51 02 00
Dec 3 11:31:00 glass kernel: SCSI host 0 abort (pid 188131) timed out - resetting
Dec 3 11:31:00 glass kernel: SCSI bus is being reset for host 0 channel 0.
Dec 3 11:31:00 glass kernel: (scsi0:0:1:0) Performing Domain validation.
Dec 3 11:31:00 glass kernel: SCSI host 0 abort (pid 188131) timed out - resetting
Dec 3 11:31:00 glass kernel: SCSI bus is being reset for host 0 channel 0.
Dec 3 11:31:00 glass kernel: scsi : aborting command due to timeout : pid 188236, scsi0, channel 0, id 3, lun 0 Read (6) 10 07 d7 02 00
Dec 3 11:31:01 glass kernel: scsi : aborting command due to timeout : pid 188244, scsi0, channel 0, id 0, lun 0 Read (6) 06 02 a1 02 00
Dec 3 11:31:02 glass kernel: SCSI host 0 channel 0 reset (pid 188131) timed out - trying harder
Dec 3 11:31:02 glass kernel: SCSI bus is being reset for host 0 channel 0.
Dec 3 11:31:03 glass kernel: SCSI host 0 reset (pid 188131) timed out again -
Dec 3 11:31:03 glass kernel: probably an unrecoverable SCSI bus or device hang.
Dec 3 11:31:04 glass kernel: (scsi0:0:0:0) Synchronous at 80.0 Mbyte/sec, offset 15.
Dec 3 11:31:04 glass kernel: (scsi0:0:1:0) Successfully completed Domain validation.
Dec 3 11:31:04 glass kernel: (scsi0:0:3:0) Synchronous at 80.0 Mbyte/sec, offset 15.
Dec 3 11:31:04 glass kernel: (scsi0:0:1:0) Synchronous at 80.0 Mbyte/sec, offset 15.
Dec 3 11:31:04 glass kernel: (scsi0:0:1:0) Performing Domain validation.
Dec 3 11:31:04 glass kernel: (scsi0:0:1:0) Successfully completed Domain validation.

All the processes that were accessing if SCSI discs are hung in a D
wait (WCHAN "wait_on_buffer" or "wait_on_page" or "down" or the like).

The machine isn't completely hung as I can still login, and a reboot
sometimes succeeds, after successfully unmounting the filesystems
(though sometimes is hangs while "unmounting filesystems").

It looks like a failed request is getting lost and never completed,
either successfully or otherwise.

If anyone has any suggestion - a fix maybe, or some suggestions where
to look or how to instrument my kernel to get more details - I would
appreciate it.

Incase it is relevant, I have a Dual PentiumII-350, Adaptech 2940-U2W
host adapter, 3 18Gb Seagate LVD drives.
The drives have ext2 filesystems and are being accessed by knfsd.

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/