Re: EXT2 and BadBlock updating.....

From: Theodore Y. Ts'o (tytso@MIT.EDU)
Date: Wed Apr 12 2000 - 10:57:23 EST


   Date: Wed, 12 Apr 2000 09:38:12 -0500
   From: Ed Carp <erc@pobox.com>

   My experience has been exactly the opposite. I've got systems that
   have been running on drives with bad sectors for, literally, years.
   One drive that has periodic bad sectors on it has been running 24x7
   for over 3 years.

There are some cases where bad blocks appear but seem to auger a chain
reaction, sure. Thats why we have the badblocks sector, and why we have
"e2fsck -c". On the other hand, usually once bad blocks start
appearing, it really is the beginning of the end. Stable bad blocks
when you mke2fs the filesystem --- sure. Stable bad blocks that appear
while the filesystem is in service --- very, very, rare. It's like
cancer, except well over 80% of the time it's malignant.

The other thing to ask is what is the economic value of the data on the
disk? And how much would it cost your client to have you recover the
disk if it were to catastrophically fail with no warning? If either of
the answers is more than $300 or so, then I'd probably replace any drive
that *has* to be running which was older than 2 years old. Disk drives
are cheap; the data on them, and unscheduled downtown, usually isn't.
If you're running production systems, you really don't want to screw
around.

(More than once while I was working for MIT, I've seen this phenomenon
happen. We would install a large number of disks for our fileservers
with disks that would all come from the same production run, and would
be put under the same load, all at the same time. Usually some amount
of time later --- 18 to 24 months, usually --- all of the disks from
that batch would start failing within a weeks of one another. It was
eerie to watch: every day or two, boom, another disk would die --- or
when we were using more modern drives, start registering soft errors
which is a hint that a disk is about to go. Usually after the first
few, people would take the hint and start scheduling mass replacements
of all of the drives, before they died and took their data with them.
The lesson here is that disks *do* wear out and fail, and they *do* have
a finite lifetime. When you have a large number of identical disks in
service, it's much easier to see this effect.)

                                                - Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Apr 15 2000 - 21:00:19 EST