Re: 2.6.19 file content corruption on ext3

From: Anton Altaparmakov
Date: Thu Dec 28 2006 - 18:59:49 EST

Next message: Rusty Russell: "Re: [PATCH] Use correct macros in raid code, not raw asm"
Previous message: Andrew Morton: "Re: [PATCH] Use correct macros in raid code, not raw asm"
In reply to: Andrew Morton: "Re: 2.6.19 file content corruption on ext3"
Next in thread: Linus Torvalds: "Re: 2.6.19 file content corruption on ext3"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 28 Dec 2006, Linus Torvalds wrote:
> Ok,
> with the ugly trace capture patch, I've actually captured this corruption
> in action, I think.
>
> I did a full trace of all pages involved in one run, and picked one
> corruption at random:
>
> Chunk 14465 corrupted (0-75) (01423fb4-01423fff)
> Expected 129, got 0
> Written as (5126)9509(15017)
>
> That's the first 76 bytes of a chunk missing, and it's the last 76 bytes
> on a page. It's page index 01423 in the mapped file, and bytes fb4-fff
> within that file.
>
> There were four chunks written to that page:
>
> Writing chunk 14463/15800 (15%) (0142344c) (1)
> Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 00001423)
> Writing chunk 14464/15800 (32%) (01423a00) (3)
> Writing chunk 14465/15800 (60%) (01423fb4) (4) <--- LOST!
>
> and the other three chunks checked out all right.
>
> And here's the annotated trace as it concerns that page:
>
> - here we write the first chunk to the page:
> ** (1) do_no_page: mapping index 00001423 at b7d1f44c (write)
> ** Setting page 00001423 dirty
>
> - something flushes it out to disk:
> ** cpd_for_io: index 00001423
> ** cleaning index 00001423 at b7d1f000
>
> - here we write the second chunk (which was split over the previous page
> and the interesting one):
> ** (2) Setting page 00001422 dirty
> ** (2) Setting page 00001423 dirty
>
> - and here we do a cleaning event
> ** cpd_for_io: index 00001423
> ** cleaning index 00001423 at b7d1f000
>
> - here we write the third chunk:
> ** (3) Setting page 00001423 dirty
>
> - here we write the fourth chunk:
> ** (4) NO DIRTY EVENT
>
> - and a third flush to disk:
> ** cpd_for_io: index 00001423
> ** cleaning index 00001423 at b7d1f000
>
> - here we unmap and flush:
> ** Unmapped index 00001423 at b7d1f000
> ** Removing index 00001423 from page cache
>
> - here we remap to check:
> ** do_no_page: mapping index 00001423 at b7d1f000 (read)
> ** Unmapped index 00001423 at b7d1f000
>
> - and finally, here I remove the file after the run:
> ** Removing index 00001423 from page cache
>
> Now, the important thing to see here is:
>
> - the missing write did not have a "Setting page 00001423 dirty" event
> associated with it.
>
> - but I can _see_ where the actual dirty event would be happening in the
> logs, because I can see the dirty events of the other chunk writes
> around it, so I know exactly where that fourth write happens. And
> indeed, it _shouldn't_ get a dirty event, because the page is still
> dirty from the write of chunk #3 to that page, which _did_ get a dirty
> event.
>
> I can see that, because the testing app writes the log of the pages it
> writes, and this is the log around the fourth and final write:
>
> ...
> Writing chunk 5338/15800 (60%) (0076eb48) PFN: 76e/76f
> Writing chunk 960/15800 (60%) (00156300) PFN: 156
> Writing chunk 14465/15800 (60%) (01423fb4) <----
> Writing chunk 8594/15800 (60%) (00bf74a8) PFN: bf7
> Writing chunk 556/15800 (60%) (000c62f0) PFN: c6
> Writing chunk 15190/15800 (60%) (01526678) PFN: 1526
> ...
>
> and I can match this up with the full log from the kernel, which looks
> like this:
>
> Setting page 0000076e dirty
> Setting page 0000076f dirty
> Setting page 00000156 dirty
> Setting page 000000c6 dirty
> Setting page 00001526 dirty
>
> so I know exactly where the missing writes (to our page at pfn 1423,
> and the fpn-bf7 page) happened.
>
> - and the thing is, I can see a "cpd_for_io()" happening AFTER that
> fourth write. Quite a long while after, in fact. So all of this looks
> very fine indeed. We are not losing any dirty bits.
>
> - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses
> the SAME dirty bit as write 4 did (which didn't make it out to disk!).
> The event that clears the dirty bit that write 3 did happens AFTER
> write 4 has happened!
>
> So if we're not losing any dirty bits, what's going on?
>
> I think we have some nasty interaction with the buffer heads. In

But are chunks 3 and 4 in separate buffer heads? Sorry could not see it
immediately from the output you showed...

It is just that there may be a different cause rather than buffer dirty
state...

A shot in the dark I know but it could perhaps be that a "COW for
MAP_PRIVATE" like event happens when the page is dirty already thus the
second write never actually makes it to the shared page thus it never gets
written out.

I am almost certainly totally barking up the wrong tree but I thought it
may be worth mentioning just in case there was a slip in the COW logic or
page writable state maintenance somewhere...

Best regards,

Anton

> particular, I don't think it's the dirty page bits that are broken (I
> _see_ that the PageDirty bit was set after write 4 was done to memory in
> the kernel traces). So I think that a real writeback just doesn't happen,
> because somebody has marked the buffer heads clean _after_ it started IO
> on them.
>
> I think "__mpage_writepage()" is buggy in this regard, for example. It
> even has a comment about its crapola behaviour:
>
> /*
> * Must try to add the page before marking the buffer clean or
> * the confused fail path above (OOM) will be very confused when
> * it finds all bh marked clean (i.e. it will not write anything)
> */
>
> however, I don't think that particular thing explains it, because I don't
> think we use that function for the cases I'm looking at.
>
> Anyway, I'll add tracing for page-writeback setting/cleaning too, in case
> I might see anything new there..
>
> Linus

--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Rusty Russell: "Re: [PATCH] Use correct macros in raid code, not raw asm"
Previous message: Andrew Morton: "Re: [PATCH] Use correct macros in raid code, not raw asm"
In reply to: Andrew Morton: "Re: 2.6.19 file content corruption on ext3"
Next in thread: Linus Torvalds: "Re: 2.6.19 file content corruption on ext3"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]