Re: critical bugs in md raid5

From: Marc Lehmann
Date: Thu Jan 27 2005 - 01:32:23 EST


On Thu, Jan 27, 2005 at 06:11:34AM +0100, Andi Kleen <ak@xxxxxx> wrote:
> Marc Lehmann <linux-kernel@xxxxxxxx> writes:
> > The summary seems to be that the linux raid driver only protects your data
> > as long as all disks are fine and the machine never crashes.
>
> "as long as the machine never crashes". That's correct. If you think
> about how RAID 5 works there is no way around it. When a write to

I disagree. When not working in degraded mode, it's absolutely reasonable
to e.g. use only the non-parity data. A crash with raid5 is in no way
different to a crash without raid5 then: either the old data is on the
disk, the new data is on the disk, or you had some catastrophic disk event
and no data is on the disk.

The case I reported was not a catastrophic failure: either the old or new
data was on the disk, and the filesystem journaling (which is ext3) will
take care of it. Even if the parity information is not in sync, either old or
new data is on the disk.

> a single stripe is interrupted (machine crash) and you lose a disk
> during the recovery a lot of data (even unrelated to the data just written)
> is lost.

This is not what I described, in fact, I haven't lost any data, despite
having had a number of such problems (I did verify that afterwards, and
found no differences. Maybe this is luck, but it seems to happen in the
majority of cases, and I ahd a similar problem at least 5 or 6 times
because I didn't encounter the bug I reported).

> But that's nothing inherent in Linux RAID5. It's a generic problem.
> Pretty much all Software RAID5 implementations have it.

Indeed, but I think linux' behaviour is especially poor. For example, the
renumbering of the devices or the strange rebuild-restart behaviour (which
is definitely a bug) will make recovery unnecessarily complicated.

> RAID-1 helps a bit, because you either get the old or the new data,
> but not some corruption.

You don't get any magical corruption with RAID5 either... the data contents
will either be old, or new. The differnce is that you cannot trust parity.

> In practice even old data can be a big
> problem though (e.g. when file system metadata is affected)

Of course, but that's supposed to be worked around by using a journaling
file system, right?

> Morale: if you really care about your data backup very often and
> use RAID-1 or get an expensive hardware RAID with battery backup
> (all the cheap "hardware RAIDs" are equally useless for this)

Yes, I am thinking of that for some time now, but always had a problem
because the affordable ones have low performance. But given linux'
effective slower-than-a-single-disk performance it shouldn't be hard to
beat nowadays.

There is, however, at least the resyncing with only 4 out of 5 disks, that
is doubtlessly a bug somewhere.

--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ pcg@xxxxxxxx
--==---/ / _ \/ // /\ \/ / http://schmorp.de/
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/