Re: SCSI Kernel Problem - BAD

Simon Shapiro (Shimon@i-connect.net)
Mon, 11 Mar 1996 16:43:01 -0600 (CST)


Hi "Leonard N. Zubkoff"; On 07-Mar-96 you wrote:
> > Date: Wed, 06 Mar 1996 00:39:28 -0600 (CST)
> From: Simon Shapiro <Shimon@i-connect.net>
>
> I am posting to these two groups as the solution is somewhere in between :-)
> This problem has been going on for a long time. I sort of learned to live
> with it as no one seemed to be able/interested in helping me solve it.
>
> 1. Stage: 1.3.71, 64MB RAM, eata_dma and/or Buslogic, etc.
> 2. Action: cd /some_big_filesystem;
> find . -print | cpio -dmpv /another_big_partition
> You can replace this with tar | tar, etc.
> 3. Result: SCSI bus reset in a loop due to timeout on one
> disk or another (typically the same disk).
>
> How reproducible is this supposed to be? I built a vanilla 1.3.71 kernel,
> setup one 2.1GB disk as /x and another as /y, populated /x with 377MB of X11
> source files, and then executed the above find | cpio (without the -v) 10 time
s
> in a row copying from /x to /y and then deleting the files on /y. No problems
> whatsoever.
>

Yes, but;

1. How many devices on the SCSI chain?
2. How many SCSI chains?
3. Narrow or wide drives?
4. How many disks participate in the transfer (any RAID?)

My feeling is that these things happen not under normal load conditions (i.e.
the CPU queues more I/O ops than the device is capable of and is throttling).
It happens when the I/O subsystem is very fast AND the CPU has many requests.

For example, it happens on the DPT controller when the source/target is a RAID-0
array and the target is a FAST tape drive AND the blocking factor is such that
the tape streams constantly.

It happens on the BusLogic, wheen the source or destination are in a RAID array
or a FAST tape. A single disk to single disk simply does not generate enough
I/O Op per second (A GOOD disk will do 80-100 IOPs. RAID them and observe 200-50
0
IOOPS on the same SCSI bus. Now it will crash.

> These are NOT bugs in the eata_dma driver, nor the BusLogic driver (unless
> they are both bad exactly the same way - hard to swallow).
> The bus reset comes from a layer above the HBA. Different HBA's react
> differently but the result is the same:
>
> FAST SCSI I/O ON LINUX IS IMPOSSIBLE WITHOUT CRASHES.
> LARGE BLOCK I/O ON LINUX IS IMPOSSIBLE WITHOUT CRASHES
>
> Well, actually I believe there are presently problems with abort and reset
> handling that are inherent in the current interfaces, and so there is no way
> the drivers can work correctly. I described one of these problems last week
> but I haven't seen any discussion so perhaps that message was swallowed
> somewhere. More on these issues in a subsequent message.
>
> We had this problem for a long time. I posted it several times. I can
> repeat and reproduce it any time.
>
> Do you have any idea if 1.2.13 suffers from this problem as well, or is it onl
y
> the 1.3.x kernels that do? Can you give me better guidance on setting up a
> test environment to reproduce this?

If my memory serves me, 1.2 kernels were less sensitive. 1.2.13, i think is als
o
broken (only repeatable with the DPT (eata_dma)).

>
> The problem cannot be reproduced by random seeking, dd or any other trivial
> (but useless) method. It van only be reproduced by doing the type of fast
> copy I am describing above. A clue can be probably found in the fact that
> backup to tape never crashes, rs (wich does random read) never crashes,
> dd to /dev/null never crashes. It always crashes on the WRITE side,
> seemingly to the same drive.
>
> The error is always an infinite loop of
> ``SCSI: resetting host scsi[01] due to target n''
>
> Does it look something like this:
>
> <6>SCSI host 0 abort (pid 395210) timed out - resetting
> <6>scsi0: Resetting BusLogic BT-958 due to Target 0
> <6>scsi0: *** BusLogic BT-958 Initialized Successfully ***
> <6>SCSI host 0 abort (pid 395560) timed out - resetting
> <6>scsi0: Resetting BusLogic BT-958 due to Target 0
> <6>scsi0: *** BusLogic BT-958 Initialized Successfully ***
> <6>SCSI host 0 abort (pid 395210) timed out - resetting
> <6>scsi0: Resetting BusLogic BT-958 due to Target 0
> <6>scsi0: *** BusLogic BT-958 Initialized Successfully ***
> <6>SCSI host 0 abort (pid 395560) timed out - resetting
> <6>scsi0: Resetting BusLogic BT-958 due to Target 0
> <6>scsi0: *** BusLogic BT-958 Initialized Successfully ***
>
> but with no recovery actually occurring? Unfortunately, I cannot generate
> these problems on demand as you appear to be able to.

Yes, except that it is scsi1 and target 4 :-)

>
> It is always as a result of a ``timeout''. There is no way to kill it,
> sync never returns, umount never returns, df never returns. Therefore
> shutdown never completes.
>
> The errors I've been looking into all start with timeouts as well, but that's
> most likely because the abort/reset error recovery mechanisms are where the
> problems are, and those are only exercised when timeouts occur (well, resets
> also happen when commands fail).
>
> Leonard

As I already said, you are right. I disabled the reset calls and the system
does not crash. I get tons of complaints (I enabled DEBUG_TIMEOUT) in scsi.c),
but no crashes.

Sincerely Yours,
(Sent on 03/11/96, 16:43:01)
Simon Shapiro i-Connect.Net, a Division of iConnect Corp.
Shimon@i-Connect.Net 13455 SW Allen Blvd., Suite 140 Beaverton OR 97008