Re: SCSI Kernel Problem - BAD

Laszlo Vecsey (master@internexus.net)
Wed, 6 Mar 1996 08:57:37 -0500 (EST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Dan Merillat: "crash in .69 during networking."
Previous message: Ulrich Windl: "Re: Long-standing bugs and typos"

Just thought I'd mention that I've been having the same problem over the
months with various Linux kernels with both an Adaptec 1742 EISA and most
recently an Adaptec 2940 PCI. The 2940 is stable using the Redhat 1.2.8
bootdisk though. With recent kernels (1.3.70) I get the same timeouts and
errors that you mention below.

On Wed, 6 Mar 1996, Simon Shapiro wrote:

> Hi,
>
> I am posting to these two groups as the solution is somewhere in between :-)
> This problem has been going on for a long time. I sort of learned to live
> with it as no one seemed to be able/interested in helping me solve it.
>
> 1. Stage: 1.3.71, 64MB RAM, eata_dma and/or Buslogic, etc.
> 2. Action: cd /some_big_filesystem;
> find . -print | cpio -dmpv /another_big_partition
> You can replace this with tar | tar, etc.
> 3. Result: SCSI bus reset in a loop due to timeout on one
> disk or another (typically the same disk).
> 4. Pre-Mature Conclusion: Bad disk
> 5. Action: Replace disk == no difference.
> Run rs on ``bad disk'' == no errors
> dd if=/dev/bad_disk of=/dev/null bs=64k == No
> errors.
> 6. Action: find /bad_fs -print | cpio -C 65536 -O /dev/rmt0
> 7. Result: No Errors.
> 8. Action: Reduce blocking size == disk is still ``bad''
> 9. Desperate Action: Boot 1.3.35
> 10. Result: NO FAILURES!!!
> 11. Conclusion: 1.3.71 is broken.
>
> These also happen with 68, 64, and few others.
>
> I lied a bit. I know how to crash 1.3.35 the same way:
>
> a. cpio to tape with blocking of 1MB.
> b. do the cpio -dmp from one RAID-[01] DPT partition to another on a P5-90.
>
> These are NOT bugs in the eata_dma driver, nor the BusLogic driver (unless
> they are both bad exactly the same way - hard to swallow).
> The bus reset comes from a layer above the HBA. Different HBA's react
> differently but the result is the same:
>
> FAST SCSI I/O ON LINUX IS IMPOSSIBLE WITHOUT CRASHES.
> LARGE BLOCK I/O ON LINUX IS IMPOSSIBLE WITHOUT CRASHES
>
> We had this problem for a long time. I posted it several times. I can
> repeat and reproduce it any time.
>
> The problem cannot be reproduced by random seeking, dd or any other trivial
> (but useless) method. It van only be reproduced by doing the type of fast
> copy I am describing above. A clue can be probably found in the fact that
> backup to tape never crashes, rs (wich does random read) never crashes,
> dd to /dev/null never crashes. It always crashes on the WRITE side,
> seemingly to the same drive.
>
> The error is always an infinite loop of
> ``SCSI: resetting host scsi[01] due to target n''
>
> It is always as a result of a ``timeout''. There is no way to kill it,
> sync never returns, umount never returns, df never returns. Therefore
> shutdown never completes.
>
> These symptoms are consistent for ANY SCSI error; The enless loop, the death
> of sync, etc. Even when the hardware has a real problem. Disks mainly.
>
> SCSI tape failures typically just leave the process hung and abort.
> At times, the process will keep a disk file open and refuse to die,
> but the I/O subsystem is still alive and not a death trap.
>
> I think it would be nice if we could fix it somehow. I do not have the time
> to see my family, but will try and help as much as I can. I just do not know
> the SCSI code nearly well enough.
>
>
> Sincerely Yours,
> (Sent on 03/06/96, 00:01:23)
> Simon Shapiro i-Connect.Net, a Division of iConnect Corp.
> Shimon@i-Connect.Net 13455 SW Allen Blvd., Suite 140 Beaverton OR 97008
>
>

Next message: Dan Merillat: "crash in .69 during networking."
Previous message: Ulrich Windl: "Re: Long-standing bugs and typos"