Re: SCSI Kernel Problem - BAD

Leonard N. Zubkoff (lnz@dandelion.com)
Fri, 8 Mar 1996 09:50:51 -0800


From: "Eric Youngdale" <eric@aib.com>
Date: Fri, 8 Mar 1996 11:09:09 -0500

While you are right that there are problems with abort and reset,
I believe that they have always been there. In general, fault handling
should have been slowly been improving as time has gone on (and I get ideas
as to how to handle some of these situations).

Indeed they have been slowly getting better.

I cannot think of anything that has changed at the scsi level
which could explain this sort of problem - I guess my inclination is to
wonder whether the new page cache stuff is related.

Agreed. I expect one of the reasons the abort/reset code has not been
completely debugged is that in normal operation it gets invoked so rarely. If
there is now a higher level bug leading to timeout problems, that would cause
the abort/reset code to be used much more than ever before. It's only recently
that I've come up with a way of reliably generating such problems.

One thing to keep in mind that may be related - I have discovered
that with some disk drives, attempts to access beyond the end of the disk
will lock the thing up. It is possible that with the new page cache
we are inadvertently requesting a sector that is beyond the end, and
this is leading to a lockup. One way to test this is to simply look at the
messages logged to the console and see what sectors we are trying to
access when the thing locks up.

I'm still hoping that someone will describe a scenario that lets me reproduce
this problem on my test system.

In the meanwhile, has anyone see the long message I posted on Wednesday night
regarding these issues? I'm beginning to wonder if linux-scsi is receiving my
mail.

Leonard