Re: Nature of ext4 corruption fixed by recent patch?

From: Theodore Ts'o
Date: Tue May 19 2015 - 09:40:19 EST


On Mon, May 18, 2015 at 03:58:24PM -0700, josh@xxxxxxxxxxxxxxxx wrote:
>
> I recently had my server's filesystem implode, and I'm currently in the
> process of cleaning it up. It had widespread corruption in files and
> directories scattered across the filesystem, though all vaguely recently
> changed. Directories appeared corrupted or truncated, various files
> showed up as piles of NULs, and 5000+ files and directories ended up in
> lost+found. I observed this corruption shortly after a reboot into
> 4.0.2 (from a previous kernel of 3.16), with ext4 noticing an
> inconsistency and mounting the filesystem read-only. The underling
> disks had no errors.
>
> Reading about the corruption issue fixed by
> d2dc317d564a46dfc683978a2e5a4f91434e9711 ("ext4: fix data corruption
> caused by unwritten and delayed extents"), it sounds plausible. Can
> that strike both file data and directory data, assuming all of that data
> ended up grouped with a delayed extent? Would that bug manifest as
> corrupted directories and files filled with NULs? The system is a
> 72-way server on which I was doing piles of parallel git pulls and
> builds, so hitting a race seems plausible.

Unfortunately, I don't think you can blame all of your problems on the
bug fixed by this particular bug. First of all, it doesn't apply to
directories at all; secondly, it's been around for a long time. I'd
have to check and see whether or not 3.16 had the problem, but it
wouldn't surprise me at all. Finally, git pulls and builds are not
at all likely to hit the problem.

It requires the combination of (a) writing to a portion of a file that
was not previously allocated using buffered I/O, (b) an fallocate of a
region of the file which is a superset of region written in (a) before
it has chance to be written to disk, (c) waiting for the file data in
(a) to be written out to disk (either via fsync or via the writeback
daemons), and then (d) before the extent status cache gets pushed out
of memory, another random write to a portion of the file covered by
(a) -- in which case that specific portion of (a) could be replaced by
all zeros.

Even most database or torrent downloads are not likely to hit this
pattern, since it requires an fallocate of a previous previously (and
very recently) allocated region of a file using a buffered write.
Torrent downloads will tend to fallocate the whole file in advance,
and while Oracle or DB2 might intermix writes and fallocates, they
don't fallocate previously written regions of the file, and they use
direct I/O in any case.

So it's pretty hard to hit this bug by accident, unless you happen to
be using fsx, and even then, the only files that would get corrupted
would be the files being written using fsx. So I'm afraid you'll have
to look farther afield, and consider other bugs as well as potential
hardware problems before trusting the system again.

Cheers,

- Ted

P.S. It's bugs like these which is why I'm always amused by people
who think that just because a file system is safely being used by
their developers, that it's safe to throw production workloads on
them. These sorts of subtle data corruptors tend to be highly timing
depend, and very hard to find. Sometimes these bugs can hang around
for years before they are found and fixed. The flip side is that
fortunately, they tend to strike very rarely. It's also why I'm very
grateful for developers like Jan and Lukas. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/