Re: Linux 2.6.29

From: Eric Sandeen
Date: Mon Mar 30 2009 - 15:13:12 EST


Linus Torvalds wrote:
>
> On Mon, 30 Mar 2009, Ric Wheeler wrote:
>>> But turn that around, and say: if you don't have redundant disks, then
>>> pretty much by definition those drive flushes won't be guaranteeing your
>>> data _anyway_, so why pay the price?
>> They do in fact provide that promise for the extremely common case of power
>> outage and as such, can be used to build reliable storage if you need to.
>
> No they really effectively don't. Not if the end result is "oops, the
> whole track is now unreadable" (regardless of whether it happened due to a
> write durign power-out or during some entirely unrelated disk error). Your
> "flush" didn't result in a stable filesystem at all, it just resulted in a
> dead one.
>
> That's my point. Disks simply aren't that reliable. Anything you do with
> flushing and ordering won't make them magically not have errors any more.

But this is apples and oranges isn't it?

All of the effort that goes into metadata journalling in ext3, ext4,
xfs, reiserfs, jfs ... is to save us from the fsck time on restart, and
ensure a consistent filesystem framework (metadata, that is, in
general), after an unclean shutdown. That could be due to a system
crash or a power outage. This is much more common in my personal
experience than a drive failure.

That journalling requires ordering guarantees, and with large drive
write caches, and no ordering, it's not hard for it to go south to the
point where things *do* get corrupted when you lose power or the drive
resets in the middle of basically random write cache destaging. See
Chris Mason's tests from a year or so ago, proving that ext3 is quite
vulnerable to this - it likely explains some of the random htree
corruption that occasionally gets reported to us.

And yes, sometimes drives die, and then you are really screwed, but
that's orthogonal to all of the above, I think.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/