Re: 2.2.1 lockup (SCSI related)

Guest section DW (dwguest@win.tue.nl)
Wed, 3 Feb 1999 22:41:39 +0100 (MET)


From: Manfred Petz <pm@radawana.cg.tuwien.ac.at>

> Hmm - yes, my memory is bad but I recall encountering this
> and finding out where it was caused. If it was in aha1542.c
> then I probably fixed it. If it was in scsi_error.c or so
> then I didnt fix it.
>
> Anyway, could you try and see whether my aha1542.c fares any better?
> These days it is no longer available on ftp.win.tue.nl
> but ftp.cwi.nl/pub/aeb may work.

Ok. Here's what I get:

On startup:

scsi0 : Adaptec 1542
scsi : 1 host.
Vendor: QUANTUM Model: FIREBALL1080S Rev: 1Q09
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sda at scsi0, channel 0, id 0, lun 0
aha1542.c: I/O error mbo=2 tarstat=2 hastat=0 idlun=1
aha1542.c: I/O error mbo=3 tarstat=2 hastat=0 idlun=2
...

A bit noisy, but this is probing for nonexistent LUNs.

Vendor: SEAGATE Model: ST3283N Rev: 9303
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sdb at scsi0, channel 0, id 1, lun 0

aha1542.c: I/O error mbo=2 tarstat=0 hastat=11 idlun=40
aha1542.c: I/O error mbo=3 tarstat=0 hastat=11 idlun=60
aha1542.c: I/O error mbo=4 tarstat=0 hastat=11 idlun=80
aha1542.c: I/O error mbo=5 tarstat=0 hastat=11 idlun=a0

No device with SCSI ID 2,3,4,5 found.

aha1542.c: I/O error mbo=6 tarstat=2 hastat=0 idlun=c0
aha1542_intr_handle: sense: 70 00 06 00 00 00 00 12 00 00 00 00 00 00 00 00
extra data not valid Current error xx00:00: sense key Unit Attention

The tape unit asserts Unit Attention after the reset caused by the boot.

Vendor: EXABYTE Model: EXB-8200 Rev: 425A
Type: Sequential-Access ANSI SCSI revision: 01
Detected scsi tape st0 at scsi0, channel 0, id 6, lun 0

So, this is all precisely as it should be.

...and when accessing the drive:

I/O error mbo=4 tarstat=2 hastat=0 idlun=c8
aha1542_intr_handler: sense f0 00 20 00 00 00 20 12 00 00 00 00 00 00 00 00
ILI Current error xx09:00:sense key None
st0: Incorrect block size
*** surprising: LUN is busy, but got interrupt nevertheless

OK, the `Incorrect block size' is reasonable, original aha1542.c locks
up the same way when mt(1)'ing the correct blocksize; I didn't test this
with your version since it's late and I think it makes no difference.

With the 2.2.1 version of aha1542.c I don't get the `st0: Incorrect block size'
error, it locks up before. I don't know whether this is just a coincidence.

Ach. My fine improved driver behaves better and he asks whether this is
by coincidence :-)

Anyway, all this is precisely as expected, up to the starred line.
[ILI: Incorrect Length Indicator: Requested logical block length
does not match the logical block length of the data on the medium.
I think the 20 says that you asked for 32 blocks and got 0.]

As it says, the starred line surprises me, since one might get
the impression from the Adaptec docs that this cannot happen.
(The old driver ignores this, this driver also ignores this, except
that it prints "Surprise!".)

Lockup again in scsi_error_handler(), VC switching works.

Hm, so what does this tell us?

Concerning this lockup, we have been told that it is because
of the semaphore changes - I have not yet looked myself and have
no time today, maybe later this week. Quoting (from grant):

=========================================================================
> > Are there any known issues with semaphores that explain this ? I suppose
> > it is possible that I am stomping on the eh_wait semaphore somehow, but
> > that seems an unlikely explanation.
>
> Yes -- they are recursive now. That means that once down() returns zero
> for you, it will return with zero immediately on subsequent calls because
> you 'own' the semaphore.
>
> I've said before, and I'll say it again, that this doesn't make any sense
> for semaphores, even though it might be appropriate for a mutex.

This strikes me as absurd.

It's even more absurd that such a fundamental change in the semantics
of a core kernel primitive was made during the deep code freeze.

I can probably solve the problem in my own driver by reverting to
the "obsolete" scsi error handling, but I'm sure there are going to be
more than a few people with aha1542 adapters that will be pulling their
hair out when their system freezes after the SCSI target probe.

Is this a brown-paper-bag bug ?
=========================================================================

So, if this is correct we'll first have to fix the semaphore use
in scsi_error.c and may afterwards see whether all is well with
this tape. I would like to understand better under what circumstances
the Adaptec can return this `Busy' status.

Andries

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/