SCSI Kernel Problem - BAD

Simon Shapiro (Shimon@i-connect.net)
Wed, 06 Mar 1996 00:39:28 -0600 (CST)


Hi,

I am posting to these two groups as the solution is somewhere in between :-)
This problem has been going on for a long time. I sort of learned to live
with it as no one seemed to be able/interested in helping me solve it.

1. Stage: 1.3.71, 64MB RAM, eata_dma and/or Buslogic, etc.
2. Action: cd /some_big_filesystem;
find . -print | cpio -dmpv /another_big_partition
You can replace this with tar | tar, etc.
3. Result: SCSI bus reset in a loop due to timeout on one
disk or another (typically the same disk).
4. Pre-Mature Conclusion: Bad disk
5. Action: Replace disk == no difference.
Run rs on ``bad disk'' == no errors
dd if=/dev/bad_disk of=/dev/null bs=64k == No
errors.
6. Action: find /bad_fs -print | cpio -C 65536 -O /dev/rmt0
7. Result: No Errors.
8. Action: Reduce blocking size == disk is still ``bad''
9. Desperate Action: Boot 1.3.35
10. Result: NO FAILURES!!!
11. Conclusion: 1.3.71 is broken.

These also happen with 68, 64, and few others.

I lied a bit. I know how to crash 1.3.35 the same way:

a. cpio to tape with blocking of 1MB.
b. do the cpio -dmp from one RAID-[01] DPT partition to another on a P5-90.

These are NOT bugs in the eata_dma driver, nor the BusLogic driver (unless
they are both bad exactly the same way - hard to swallow).
The bus reset comes from a layer above the HBA. Different HBA's react
differently but the result is the same:

FAST SCSI I/O ON LINUX IS IMPOSSIBLE WITHOUT CRASHES.
LARGE BLOCK I/O ON LINUX IS IMPOSSIBLE WITHOUT CRASHES

We had this problem for a long time. I posted it several times. I can
repeat and reproduce it any time.

The problem cannot be reproduced by random seeking, dd or any other trivial
(but useless) method. It van only be reproduced by doing the type of fast
copy I am describing above. A clue can be probably found in the fact that
backup to tape never crashes, rs (wich does random read) never crashes,
dd to /dev/null never crashes. It always crashes on the WRITE side,
seemingly to the same drive.

The error is always an infinite loop of
``SCSI: resetting host scsi[01] due to target n''

It is always as a result of a ``timeout''. There is no way to kill it,
sync never returns, umount never returns, df never returns. Therefore
shutdown never completes.

These symptoms are consistent for ANY SCSI error; The enless loop, the death
of sync, etc. Even when the hardware has a real problem. Disks mainly.

SCSI tape failures typically just leave the process hung and abort.
At times, the process will keep a disk file open and refuse to die,
but the I/O subsystem is still alive and not a death trap.

I think it would be nice if we could fix it somehow. I do not have the time
to see my family, but will try and help as much as I can. I just do not know
the SCSI code nearly well enough.

Sincerely Yours,
(Sent on 03/06/96, 00:01:23)
Simon Shapiro i-Connect.Net, a Division of iConnect Corp.
Shimon@i-Connect.Net 13455 SW Allen Blvd., Suite 140 Beaverton OR 97008