Re: SCSI Kernel Problem - BAD

Simon Shapiro (Shimon@i-connect.net)
Tue, 12 Mar 1996 13:58:34 -0600 (CST)


Hi "Eric Youngdale"; On 12-Mar-96 you wrote:
> >
> >As I already said, you are right. I disabled the reset calls and the
> system
> >does not crash. I get tons of complaints (I enabled DEBUG_TIMEOUT) in
> >scsi.c), but no crashes.
>
> This leads me to wonder whether interrupt latency is getting to be
> horribly long for some reason. It sounds like the requests are eventually
> completing, but just taking lots longer than we expected.
>
> A couple of thoughts - the timeout mechanism was never designed
> with tagged queueing in mind. The clock starts when the request is first
> passed down to the low-level driver, and the clock stops when the driver
> reports that the request is done. Thus if the disk gets lots of requests
> piled into it for some reason, and they are all large, the one at the end
> will have to wait quite a bit before it gets processed.

Perfect! This is at least part of it. Leonard was nice enough to put a
message to the console when he Enables tagged queues on a drive. Soon after
that - Poof!

>
> I haven't been paying close attention - is it only the NCR and
> Adaptec 2xxx series drivers that are showing this problem?
>

No. The Buslogic driver on the 956CD (wide differential), the eata_dma
on both PM3224 (PCI RAID/Caching controller with 16MB cache) and PM3224W
(wide SCSI with 4MB cache). Different patterns and different setups to cause
the problem, but reliable as heck.

The Buslogic setup:
Older HP 2GB drives, MD RAID-0 across three drives, 64K stripes:
During intense I/O will trigger the timeout debug every 2-3 minutes.
Backup to tape (on the DPT) will trigger timeouts every 10-15 seconds.
New HP 2GB drives:
Fine on read operations. Immediate disaster on any WRITE operation.

DPT:
Has 2 drives on internal RAID-1, three drives on RAID-0, stripe size is 64K
in both. The arrays look to Linux as ``wierd'' disks (Linux cannot see the
actual drives at all):
Any tape activity will trigger immediate reaction.
find . -print | cpio -dmpv another_dir across RAID arrays is a guaranteed
PANIC - instant lockup. no logs, nothing.

As I mentioned before, it is more sensitive to the transaction rate than the
queue depth. But I could be seeing a manifestation, not a symptom.

Sincerely Yours,
(Sent on 03/12/96, 13:58:34)
Simon Shapiro i-Connect.Net, a Division of iConnect Corp.
Shimon@i-Connect.Net 13455 SW Allen Blvd., Suite 140 Beaverton OR 97008