Re: SMP 2.1.90-pre3 SCSI kernel panic

Doug Ledford (dledford@dialnet.net)
Mon, 16 Mar 1998 05:21:14 -0600


sistema@readysoft.es wrote:
>
> Machine: SMP kernel pre-90-3 + dual PII + aic-5.0.7 patch +
> integrated AIC-7895 Dual Ultra SCSI + Tyan Motherboard
>
> 2.0.33 UP kernel works flawlessly. 2.0.33 SMP locks hard randomly, even
> with a BusLogic Flashpoint card instead of the Adaptec one.
>
> Since 2.1.89, including pre90-[123], SMP kernels keep hanging a later
> after getting this messages:

Unless someone else knows of a change in 2.1.x that could cause this, I'm
inclined to attribute this to a change in the way 2.1.x is trying to
allocate the space on the filesystem. Aka, 2.1.x is trying to write to disk
blocks that 2.0.x is ignoring. Here's the decode of your sense data:

> Mar 16 10:39:57 rs120 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 retu
> rn code = 28000002

The aic7xxx driver got a check condition host status, performed a request
sense operation, and is alerting the mid level code to the presense of the
sense data (it was most likely accompanied by an underrun error or else we
wouldn't have flagged the error code and instead would have let the mid
level code look at the sense data and decide if there was even an error).

> Mar 16 10:39:57 rs120 kernel: Deferred error sd08:03: sns = f1 4
> Mar 16 10:39:57 rs120 kernel: ASC= 3 ASCQ= 0
> Mar 16 10:39:57 rs120 kernel: Raw sense data:0xf1 0x00 0x04 0x00 0x7e 0x55 0x3e
> 0x0a 0x00 0x00 0x00 0x00 0x03 0x00 0x11 0x80

Broken down, this is a deferred error with valid error information (0xf1 ==
0x71 (deferred error) | 0x80 (valid bit)

Sense key of 0x04, quoting from the SCSI-II spec:

4h HARDWARE ERROR. Indicates that the target detected a non-
recoverable hardware failure (for example, controller failure,
device failure, parity error, etc.) while performing the command
or during a self test.

ASC=0x03, ASCQ=0x00 is found in the table to be:
03h 00h DTL W SO PERIPHERAL DEVICE WRITE FAULT

Sounds like a few bad sectors to me.

> Last message is a kernel panic. I even get messages complaning about
> insufficient disk space, but there´s free space.

The insufficient disk space is probably the result of the ext2fs not being
able to properly read/write some inode block. The kernel panic would have
to be posted before I could comment on it.

> Any hints?
> I can turn on scsi debugging and try to catch that bug with some help.

Best solution to this problem is to get the scsiinfo package, use it to make
sure the AWRE and ARRE bits are turned on in the read/write error recovery
mode page on the SCSI drive, then back everything on the drive up, low level
format the drive, and re-install. If the AWRE and ARRE bits weren't on
before, then they should help in the future as the drive should
automatically remap bad sectors out on the fly with those bits set. An
alarmingly large number of SCSI drives these days ship with this bits turned
off.

-- 

Doug Ledford <dledford@dialnet.net> Opinions expressed are my own, but they should be everybody's.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu