Re: 2.4.0-test4&6 scsi tape problem [not fixed :-/]

From: Kai Makisara (Kai.Makisara@metla.fi)
Date: Thu Aug 31 2000 - 09:54:19 EST


I suggest we move this discussion from linux-kernel to linux-scsi.

On Thu, 31 Aug 2000, G. Saraber wrote:

> "Richard B. Johnson" wrote:
> >
> > On Wed, 30 Aug 2000, G. Saraber wrote:
> >
> > > Thanks for the excellent guide on how to pinpoint the problem ...
> > >
> > > guess what :-) I decided before I send in another bugreport i'll upgrade
> > > to test7 so the developers dont have to dig through 'old' kernel
> > > versions ..
> > > anyway, the problem went away, they must have fixed it :-)
>
> ok,
> i wasn't able to reproduce the problem before, but now it's occurred
> again, this the second nightly backup since booting with 2.4.0-test7,
> the first one went fine which i couldnt get to happen with test4 and
> test6 so test7 is slightly better. The error occured after a little over
> 3GB is backed up on a 12GB (uncompressed space) tape.
> However this time I have logs :-) and once again, "mt offline" or the
> button on the tapedrive itself won't release the tape,
> any mt command after the first "mt offline" gives:
> [root@ahr log]# mt offline
> /dev/tape: Input/output error
>
> right away, it doesnt even try to access the drive the second time
> around..
>
> (logs attached below) i'll do more testing later today.
> i had to heavily snip the log to keep the size under control to save
> bandwith, i'll gladly send you the full list in the format of your
> choice, just drop me a line.
>
> Regards,
> Gerard Saraber
> gsaraber@rarcoa.com
> http://www.rarcoa.com
> ---------------- scsi log ------------------------------
> Aug 31 00:42:50 ahr kernel: scsi : aborting command due to timeout : pid
> 0, scsi0, channel 0, id 3, lun 0 0x2a 00 00 09 b4 d5 00 00 02 00
> Aug 31 00:43:15 ahr kernel: scsi : aborting command due to timeout : pid
> 0, scsi0, channel 0, id 3, lun 0 0x28 00 00 02 17 3f 00 00 80 00
> Aug 31 00:43:20 ahr kernel: scsi : aborting command due to timeout : pid
> 0, scsi0, channel 0, id 3, lun 0 0x2a 00 00 09 b4 d5 00 00 02 00
> Aug 31 00:43:20 ahr kernel: scsi : aborting command due to timeout : pid
> 0, scsi0, channel 0, id 4, lun 0 0x2a 00 00 26 74 f3 00 00 02 00
> Aug 31 00:43:20 ahr kernel: scsi : aborting command due to timeout : pid
> 0, scsi0, channel 0, id 4, lun 0 0x2a 00 00 26 77 1b 00 00 22 00

You get timeouts from devices 3 and 4 on the bus (I assume these are
disks). These are read (0x28) and write (0x2a) commands (10 byte
versions). Note that these timeouts start at 00:42:50.

The disks probably time out because the tape drive has not released the
SCSI bus.

[timeout messages cut]

> Aug 31 00:57:20 ahr kernel: scsi : aborting command due to timeout : pid
> 0, scsi0, channel 0, id 3, lun 0 0x28 00 00 01 74 57 00 00 68 00
> Aug 31 00:57:20 ahr kernel: scsi : aborting command due to timeout : pid
> 0, scsi0, channel 0, id 5, lun 0 0x0a 01 00 00 40 00
                                   ^^^^^^^^^^^^^^^^^^^
This is presumably the last command sent to the tape drive. It is a write
command that writes 64 tape blocks in fixed block mode. This looks legal.

The time here is 00:57:20 which means that the tape command times out 15
minutes after the first timeout from the disks. The tape driver timeout is
15 minutes and so the tape command times out properly.

After this the SCSI subsystem decides that it should try to reset the SCSI
bus to resolve the problem.

> Aug 31 00:57:20 ahr kernel: SCSI host 0 abort (pid 0) timed out -
> resetting
> Aug 31 00:57:20 ahr kernel: SCSI bus is being reset for host 0 channel
> 0.
> Aug 31 00:57:20 ahr kernel: (scsi0:0:5:0) Synchronous at 10.0 Mbyte/sec,
> offset 15.
> Aug 31 00:57:20 ahr kernel: (scsi0:0:3:0) Synchronous at 80.0 Mbyte/sec,
> offset 31.
> Aug 31 00:57:20 ahr kernel: (scsi0:0:4:0) Synchronous at 80.0 Mbyte/sec,
> offset 31.
> Aug 31 00:57:25 ahr kernel: st0: Error with sense data: Info fld=0x40,
> Current st09:00: sns = f0 6
                             ^
translation: UNIT ATTENTION
(I am not trying to shout: the capital letters are cut from the SCSI
standard draft :-)
(If you enable verbose SCSI messages in the kernel configuration, the
kernel does this translation for you.)

> Aug 31 00:57:25 ahr kernel: ASC=29 ASCQ= 0
                              ^^^^^^^^^^^^^^
translation: POWER ON, RESET, OR BUS DEVICE RESET OCCURRED

> Aug 31 00:57:25 ahr kernel: Raw sense data:0xf0 0x00 0x06 0x00 0x00 0x00
> 0x40 0x12 0x00 0x00 0x00 0x00 0x29 0x00 0x00 0x00 0x00 0x00 0x00 0x00
> 0x00 0x00 0x00 0x00 0x00 0x00
> Aug 31 00:57:25 ahr kernel: st0: Error with sense data: Info fld=0x1,
> Current st09:00: sns = f0 2
                             ^
translation: NOT READY

> Aug 31 00:57:25 ahr kernel: ASC= 4 ASCQ= 1
                              ^^^^^^^^^^^^^^
translation: LOGICAL UNIT IS IN PROCESS OF BECOMING READY

When the tape driver sees a unit attention anywhere else than at open(),
it prevents further access to the tape until some command is issued that
puts the tape into a known position. Rewind is one example. So, the fact
that you don't seem to be able to do anything with the tape after the bus
reset may not a bug. (MTOFFL should be allowed but there may be something
I overlook.)

The real problem is why the tape drive does not release the SCSI bus when
it has been sent a command that from the log output looks legal. Usually
the tape drives and SCSI adapters are configured so that if any command
takes a long time, the SCSI bus is released when the data has been
transferred. I assume that SCSI disconnects are enabled in your
configuration?

The most common reasons for a SCSI device hanging are:
- problems with the SCSI cables and/or termination
  - your drive works with earlier kernels; it is possible (maybe not
    probable) that the new kernel shows problems if the cables/termination
    are marginal
- the power supply is not large enough

It is also possible that some change in the kernel has created a bug. The
changes in the tape driver after 2.2 have been such that they should not
affect normal writing (barring side effects from the changes). Besides
changes in the tape driver, there have been changes in the middle level
SCSI code and the aic7xx host driver. Other changes in the kernel have
also changed the environment where these drivers work.

I am sorry that I can't provide any direct solution for your problem at
this time. Above I have only deciphered from the logs what has happened,
not why. This needs more thinking...

        Kai

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Aug 31 2000 - 21:00:27 EST