Re: Adaptec SCSI Driver fails during mirroring failover testing (2.2.15/2.3.99-pre6)

From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Thu Apr 27 2000 - 17:03:07 EST


Doug Ledford wrote:
>
> "Jeff V. Merkey" wrote:
>
> >
> The driver (and certainly not the mid level code either) doesn't do this. The
> only thing that will typically result in this type of behavior is when the act
> of removing the device caused an electrical/signal problem on the bus that is
> now putting the bus into an infinite wedge scenario where no commands to other
> devices can get through. I've also seen this in certain types of drive
> failures when the drive gets so confused that it esentially never releases the
> BUSY pin on the bus, even after repeated bus resets.

This may be what's happening. The "reseting SCSI bus" message loops
endlessly when this happens -- it doesn't appear that the previous IO
requests are even being retried, the system hangs in an endless loop (10
minutes should be more than enough time for the SCSI bus to recover
itself).
>
> >
> If you implement your own timers and want to risk confusing the hell out of
> the mid layer SCSI code, then go right ahead and do this. Otherwise, you have
> to wait for the mid layer SCSI code to tell you that a command has timed out
> and then take appropriate action. This is by design (although one that many
> of us bitch about on occasion), not an ommision from the aic7xxx driver.

Ok.

>
> No, it shouldn't. But, without the actual error messages or a repeatable case
> of this (since I don't have that problem here), there's not much I can do
> about it. Since you are using async I/O to do this mirror operation (or at
> least I thought you said you were), what's the retry limit on those commands?
> Are they retrying forever or when the aic7xxx driver returns an error to the
> upper layer is it getting flagged as such and the operation dropped?

I try the operation once, then if it fails, I drop into the hotfix code,
and attempt to recover the read from another active mirror. If I cannot
find the data on a mirror, I ask the LRU if it has the missing data in
memory (in 95% of IO error cases, data is present on another mirror or
in the cache). If the data cannot be located, I drop into a read retry
operation where I change the sector read order and cycle through each
block with a differnt interleave to see if I can re-read the failing
sectors. If I cannot get all the sectors, I build a bad_bit mask that
is written to the hotfix area, allocate a hotfixed block, write the
recovered data and bad_bits (if present) into the table. I then
artificially hotfix all the active mirrors in that mirror group with
bad_bits (if present) just for the read I/O error case. When someone
writes to a hotfixed area, the bad_bits get cleared. If someone reads
from a read hotfixed area with bad_bits, and IO error gets propogated up
to the calling user if their read enters a region defined by the bad
bits (so they know they may have lost data). In other words, when the
IO request fails, it will generate several other IO requests during
recovery and failover. If the drive has gone completely bye-bye during
hotfix retry, I update the mirror group headers on each device, and tell
them that the device is no longer present.

>
> > NOTE: once you pull that device out, with the exception of any tagged commands
> that were active at the time, all future commands from the aic7xxx driver will
> get returned after the SELECTION_TIMEOUT has occured. Those commands that
> were outstanding to the device will get returned after the first bus reset.
> Once they have been returned, the mid layer will requeue them, and this time
> they too should get a SELECTION_TIMEOUT.
>
> I need a duplicatable test case. I also need to know the nature of the SCSI
> bus at the time this all happened. I need to know if maybe the drive was
> mostly removed from its contacts but maybe had just enough of it's edge
> connector still in contact that it was actually screwing the bus while just
> setting there.

The drive was completely removed.

Anything along these lines can help to track it down. Also,
> the original post talks about 2.2.15 and 2.3.99-pre, I need to know which this
> happened under (or if both), and you should probably update 2.2.15 with the
> latest aic7xxx driver which is on my web site.
>
>

It happened on both. I will download the latest this evening and
reproduce the problem. I will also send you the messages file so you
can see the obnoxious messages.

:-)

Jeff

--
> 
>  Doug Ledford <dledford@redhat.com>  http://people.redhat.com/dledford
>       Please check my web site for aic7xxx updates/answers before
>                       e-mailing me about problems

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Apr 30 2000 - 21:00:13 EST