Re: [PATCH] Update Documentation/md.txt to mention journaling won'thelp dirty+degraded case.

From: Ric Wheeler
Date: Thu Sep 03 2009 - 08:06:04 EST


On 09/02/2009 06:49 PM, Rob Landley wrote:
From: Rob Landley<rob@xxxxxxxxxxx>

Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
explaining that using a journaling filesystem can't overcome this problem.

Signed-off-by: Rob Landley<rob@xxxxxxxxxxx>
---

Documentation/md.txt | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/Documentation/md.txt b/Documentation/md.txt
index 4edd39e..52b8450 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use

md-mod.start_dirty_degraded=1

+Note that Journaling filesystems do not effectively protect data in this
+case, because the update granularity of the RAID is larger than the journal
+was designed to expect. Reconstructing data via partity information involes
+matching together corresponding stripes, and updating only some of these
+stripes renders the corresponding data in all the unmatched stripes
+meaningless. Thus seemingly unrelated data in other parts of the filesystem
+(stored in the unmatched stripes) can become unreadable after a partial
+update, but the journal is only aware of the parts it modified, not the
+"collateral damage" elsewhere in the filesystem which was affected by those
+changes.
+
+Thus successful journal replay proves nothing in this context, and even a
+full fsck only shows whether or not the filesystem's metadata was affected.
+(A proper solution to this problem would involve adding journaling to the RAID
+itself, at least during degraded writes. In the meantime, try not to allow
+a system to shut down uncleanly with its RAID both dirty and degraded, it
+can handle one but not both.)

Superblock formats
------------------



NACK.

Now you have moved the inaccurate documentation about journalling file systems into the MD documentation.

Repeat after me:

(1) partial writes to a RAID stripe (with or without file systems, with or without journals) create an invalid stripe

(2) partial writes can be prevented in most cases by running with write cache disabled or working barriers

(3) fsck can (for journalling fs or non journalling fs) detect and fix your file system. It won't give you back the data in that stripe, but you will get the rest of your metadata and data back and usable.

You don't need MD in the picture to test this - take fsfuzzer or just dd and zero out a RAID stripe width of data from a file system. If you hit data blocks, your fsck (for ext2) or mount (for any journalling fs) will not see an error. If metadata, fsck in both cases when run will try to fix it as best as it can.

Also note that partial writes (similar to torn writes) can happen for multiple reasons on non-RAID systems and leave the same kind of damage.

Side note, proposing a half sketched out "fix" for partial stripe writes in documentation is not productive. Much better to submit a fully thought out proposal or actual patches to demonstrate the issue.

Rob, you should really try to take a few disks, build a working MD RAID5 group and test your ideas. Try it with and without the write cache enabled.

Measure and report, say after 20 power losses, how files integrity and fsck repairs were impacted.

Try the same with ext2 and ext3.

Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/