Re: 2.6.19 file content corruption on ext3

From: Linus Torvalds
Date: Tue Dec 19 2006 - 14:09:33 EST




On Tue, 19 Dec 2006, Linus Torvalds wrote:
>
> here's a totally new tangent on this: it's possible that user code is
> simply BUGGY.

Btw, here's a simpler test-program that actually shows the difference
between 2.6.18 and 2.6.19 in action, and why it could explain why a
program like rtorrent might show corruption behavious that it didn't show
before.

#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <string.h>

int main(int argc, char **argv)
{
char *mapping;
int fd;

fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, 10) < 0)
return -1;
mapping = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (-1 == (int)(long)mapping)
return -1;
memset(mapping, 0xaa, 20);
sync();
if (ftruncate(fd, 40) < 0)
return -1;
memset(mapping + 20, 0x55, 20);
write(1, mapping, 40);
return 0;
}

Notice the "sync()" in between the "memset()" and the "ftruncate()". In
2.6.18, that would normally do absolutely _nothing_ to the shared memory
mapping, becuase we simply couldn't track pages that were dirty in the
page tables.

So in 2.6.18, if you try this, with

./a.out | od -x

you should see something like

0000000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa
0000020 aaaa aaaa 5555 5555 5555 5555 5555 5555
0000040 5555 5555 5555 5555
0000050

which matches your memset() patterns: 20 bytes of 0xaa, and 20 bytes of
0x55.

HOWEVER.

In 2.6.19, because we actually track dirty data so much better, "sync()"
will actually be smart enough to write out the dirty mmap'ed data too. But
since the user program has only allocated ten bytes for it in the file,
when it is written out, the rest of the page is cleared. When you then
write the last 20 bytes (after _properly_ allocating memory for them), you
should now see a pattern like

0000000 aaaa aaaa aaaa aaaa aaaa 0000 0000 0000
0000020 0000 0000 5555 5555 5555 5555 5555 5555
0000040 5555 5555 5555 5555
0000050

instead: with ten bytes of zero in between, because the data that couldn't
be written out was cleared.

So 2.6.19 is strictly _better_, but exactly because it's tracking dirty
status much more precisely, you'll see certain user-level bugs much more
easily.

NOTE NOTE NOTE! The code really _was_ buggy in 2.6.18 too, and you _can_
get the zeroes in the middle of the file with an older kernel. But in
older kernels, you need to be really really unlucky, and have the page
cleaned by strong memory pressure. In 2.6.19, any "sync()" activity
(includign from the outside) will clean the page, so a user program with
this bug can just be made to trigger the bug much more easily.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/