Re: Driver retries disk errors.

From: Erik Mouw
Date: Tue Aug 31 2004 - 12:03:25 EST


On Tue, Aug 31, 2004 at 04:13:54PM +0100, Alan Cox wrote:
> On Maw, 2004-08-31 at 16:56, Erik Mouw wrote:
> > The SCSI disk driver has been doing a single retry for quite some time
> > and it hasn't really bitten people. Why would the IDE disk driver be
> > different? The only case I can imagine a retry would be OK, is when we
> > get an UDMA CRC error (caused by bad cables).
>
> Retries also pop up in other less obvious cases and conveniently paper
> over a wide variety of timeouts, power management quirks and drives just
> having a random fit. Eight is probably excessive in all cases.

There are indeed all sorts of other retries in various layers, the
worst one when the kernel tries to read-ahead a couple of blocks on an
IDE disk(1). You can work around them with raw IO or O_DIRECT.

> For non hard disk cases many devices do want and need retry.

And many others do not. CompactFlash readers are usually implemented as
a USB storage device, which on its turn is implemented as a SCSI
"disk". So far I haven't seen a CompactFlash which could be "fixed" by
retries.

iSCSI appliances can also make things worse: when the target machine is
implemented as a simple "pass everything to the real SCSI disk" device,
it's not really different from a directly connected SCSI disk. However,
when the iSCSI target interprets the SCSI commands and has its own way
to deal with bad blocks (i.e.: it retries the blocks), things can get
very bad when the kernel also uses a couple of retries.

In my experience it would be good if the IDE disk driver would behave
like the SCSI disk driver: no retries on a bad block. I agree that it
would be a good idea to make it configurable through the block layer to
avoid code duplication (blockdev --getretries/--setretries).


Erik

(1) Imagine an application doing a linear read on a file with an 8
block read ahead and the last block being bad. The kernel will try to
read that bad block 16 times, but because the IDE driver also has 8
retries, the kernel will try to read that bad block *64* times. It
usually takes an IDE drive about 2 seconds to figure out a block is
bad, so the application gets stuck for 2 minutes in that single bad
block.

--
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/