Re: Buslogic weirdness?

Leonard N. Zubkoff (lnz@dandelion.com)
Tue, 19 Nov 1996 23:27:02 -0800


Date: Tue, 19 Nov 1996 21:24:39 -0500
From: Anil Somayaji <soma@ai.mit.edu>

When I came in this evening, I found a bunch of messages referring to
command timeouts. The system seems to have recovered fine this time,
but a few weeks ago I had a situation where I got a bunch of write
errors that somehow propagated to the point of making the entire 4G
drive unusable. What was even weirder was that I got similar errors
when I tried to restore from tape; after doing a hard reformat,
though, things seemed to be fine until tonight. (I'm sorry I don't
have a record of the messages - they got destroyed along with the
rest.)

The one thing I could correlate was that I was trying to copy a lot of
files (~300 megs, several hundred) from one partition to another on
the drive by using a tar pipe. I now have only two partitions on the
drive: data and swap.

Here's the system: Linux 2.0.24, Debian 1.1, Pentium Pro 200 Mhz,
Buslogic BT-958, with the following attached:

Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: MICROP Model: 3243-19 1128RLAV Rev: RLAV
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 02 Lun: 00
Vendor: TOSHIBA Model: CD-ROM XM-5401TA Rev: 3605
Type: CD-ROM ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 03 Lun: 00
Vendor: HP Model: C1533A Rev: 9503
Type: Sequential-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 04 Lun: 00
Vendor: iomega Model: jaz 1GB Rev: H.71
Type: Direct-Access ANSI SCSI revision: 02

The Jaz drive is external, and is connected by a slightly weird cable
(had to go from a wide SCSI connector to a regular SCSI-2 connector),
but I don't think it is a cabling problem, primarily because my
initial failed restore occured after the Jaz drive was detached. Of
course, I could be wrong.

Whatever the problem is, it is very intermittent. Maybe this is a
hardware/cabling problem; but then again, has anyone else seen
behavior like this?

>From the error messages you enclosed, it appears that you had a SCSI bus hang
while 18 commands were outstanding to the Micropolis drive. The SCSI subsystem
and driver attempted to abort the commands, but the aborts also timed out, and
so sending a Bus Device Reset message was tried next. That too failed, so a
Hard Reset was issued to the host adapter, which forces a full SCSI Bus Reset.
The host adapter was successfully reinitialized, and I assume the system
resumed normal operation, and retried the timed out commands.

SCSI bus hangs of this sort are usually due to (1) cabling and/or termination
problems, (2) bugs in the disk drive firmware, or (3) bugs in the host adapter
firmware. Also, older Micropolis 3243 units had some general reliability
problems. If the previous errors you had actually required reformatting the
drive, that's not likely to be a host adapter or operating system issue. I'd
be very concerned that the drive may not be reliable and further problems are
likely.

I don't know what version of firmware is in your BT-958, but a new version was
officially released recently. I've updated my Linux web page with copies of
the latest BT-948/958/958D firmware and BIOS and the Flash utility. I suggest
you update your BT-958 to the new firmware.

Finally, if the drive's firmware doesn't implement tagged queuing correctly,
all sorts of problems are possible under heavy I/O loads. You can disable
tagged queuing entirely by booting with the command line "BusLogic=TQ:Disable",
or allow a lower queue depth (number of outstanding commands) by booting with
"BusLogic=0,N" where N is the desired queue depth. I suggest trying 7 or 15.

Leonard