Re: Kernel Panic (aha1542.c)

Guest section DW (dwguest@win.tue.nl)
Sun, 27 Dec 1998 22:13:47 +0100 (MET)


From: Peter Rasmussen <plr@interlog.com>
Subject: Kernel Panic (aha1542.c)

During a badblocks scan of a Seagate 1239NS (200MB) SCSI harddisk sitting on
an AHA1542B adapter I got a number of these:

aha1542.c: Trying device reset for target 0

And then finally this:

Kernel panic: unable to find empty mailbox for aha1542.

As the disk is getting old it may have had a number of bad blocks on it, but
I would prefer if I didn't have to reset the whole system just because of that.

Yes. Roughly speaking this is the fault of scsi_error.c
that insists on either recovering from an error or otherwise
crashing the system. If you read some linux-kernel backlog
you'll notice lots of people with a bad CD or bad tape where
scsi_error.c decided to crash an otherwise fine functioning system.

The code goes like:
scsi_error_handler() calls scsi_unjam_host()
[already the name is misleading - nothing is wrong with the host (adapter)]
scsi_unjam_host() finds a command that timed out, and then does
scsi_try_to_abort_command() - which fails; then
scsi_try_bus_device_reset() - which fails; then
scsi_try_bus_reset() - which fails; then
scsi_try_host_reset() - which fails; then
device->online = FALSE;

This especially hits aha1542 users, since the aha1542 driver is rotten
(it is fine as long no aborts or resets are tried, but worse than worthless
as soon as aborts/resets occur), so that the scsi_error routines cannot
trust the driver responses.

You can improve the stability of your system quite a lot
by #ifdef'ing out the scsi_try_bus_reset() and scsi_try_host_reset() parts
of the above. A device error is not an indication that the host adapter is
malfunctioning and the SCSI bus needs resetting.

You can also improve the stability of your system a lot
by rewriting aha1542.c. I did this last August for a friend,
and he is happy now, but I need to experiment a bit more
to find out what this Adaptec precisely does upon an abort,
before this new driver can be released to the public.
[In fact I just started looking at this again this evening.]
Unfortunately, Adaptec itself seems to be very unhelpful.

Looking through the code in aha1542.c I found many comments about trouble with
resetting the bus, devices and the host, so it looks like an area that might
need some attention :-)

As I have a[n older] version of the adapter and what looks like a good testbed
for problems (an old disk with bad blocks) I would like to look into this. I do
however, not have the jumper-settings sheet anymore (lost between 1992 and now)
so I am not really aware of the settings. So if someone would let me know I
would appreciate it. Any other information about the adapter would be useful
as well, so if you have that or pointers to where I can get it, please let me
know.

There is some information. Eric Youngdale provided a file 1540USR.MAN
that is also available at http://ftp.digital.com/pub/micro/adaptec/ and
http://ftp.dei.uc.pt/pub/adaptec/ and http://ftp.rad.kumc.edu/share/dos/adaptec/.
The firmware is also available (but I don't know for what kind of micro it is).

Any suggestions from any of you guys would probably be helpful, eg. "try this"
or "don't do that". For a starter, I would like to know what all this about
mailboxes is about, as it is all over the code in aha1542.c ?

Mailboxes are the means of communication between you and the AHA1540.
Read 1540USR.MAN.

In case you want to discuss details, mail me - aeb@cwi.nl - do not reply.

Andries

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/