Re: EXT2 and BadBlock updating.....

From: Theodore Y. Ts'o (tytso@MIT.EDU)
Date: Wed Apr 12 2000 - 00:46:33 EST


   Date: Tue, 11 Apr 2000 22:20:43 -0700 (PDT)
   From: Andre Hedrick <andre@linux-ide.org>

   On Thu, 6 Apr 2000, Alan Cox wrote:

> > > Multiwrite IDE breaks on a disk error
> >
> > Explain.........Please........
>
> If you have one bad sector you should write the other 7..

   Now if the ata/ide driver does not address this recovery then I see big
   problems. Alan's case (my reading) states that regardless if we blow the
   write to a sector (based on 8 multi-write command) we should write all
   that we can........

That's definitely the case. If you're writing 8 sectors, and sector #4
has an error, the driver shouldn't give up and not write sectors 5-8!

   Now why the "fork recovery"? We need to finish/complete the write
   request, but because we discovered a NEW BAD BLOCK/SECTOR we should walk
   the rest of the write because there may be a section if the disk/track
   that is failing. Also this fork would provide the means to log the
   location of the newly failed sector and go back and MARK BAD and issue a
   request to the FS to update the BADBLOCKS table. Thus we get:

               0|1|2|3| 4 |5|6|7|8 0|1|2|3| 4 |5|6|7|8
   Theodore -> w|w|w|w|FSR|w|w|w|w -> T|h|e|o|FSR|d|o|r|e -> Theodore

   FSR == FaultSeekRecover THREAD............

Some kind of method where the block device layer can notify the
filesystem of specifically which blocks went bad would be useful,
probably as a callback. Actually, it's already the case that the
filesystem can find out if there are problems assuming that the write is
being done synchronously. It's just that most of the time disk writes
are done as a "fire and forget", and by the time the disk notices
something has gone wrong, the context in which the write request was
queued is long gone.

This is a Linux 2.5 issue, though --- it's not something we're going to
do for 2.4, and before we start this, we probably want to rototill the
entire block device layer anyway, since it's a bit of a mess and kludge
right now.

As far as what the filesystem can do, in *some* cases it may be able to
put the block onto the bad block, and then retry the write, but in other
cases, where the filesystem was just reading from a pre-existing file,
there really isn't much the filesystem can do that's sane. It could
unlink the block from the inode, and leave a hole there, and then relink
the block to the bad-block inode, but underlying applications still
probably won't deal well with that happening. And if the bad block is
discovered in the inode table, or some other part of the filesystem
metadata, there *really* isn't much that can be done from the kernel
level to recover from that.

What would be *really* useful would be if S.M.A.R.T., or some other
facility could inform the filesystem that a block was *about* to fail.
In that case, the block could get relocated before data got lost, and
that would certainly be worth doing. There are still some cases where
if the bad block was to happen inside critical filesystem metadata, the
recovery would be so complex that you really wouldn't want to do it
inside the kernel, but things probably still could be made better. All
of this is not something that's going to happen before 2.4 ships,
however!

And it still doesn't change my contention that somoene who wants
ultrareliability, and what you call "Enterprise class" computing,
without doing RAID, is fundamentally insane. There are things we can do
to try to recover in the face of broken hardware --- but fundamentally,
cheap sh*t hardware is still cheap sh*t hardware. You don't make
Enterprise class computers out of cheap sh*t. It just doesn't happen.

                                                        - Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Apr 15 2000 - 21:00:17 EST