Re: Linux 2.6.29

From: Jan Kara
Date: Wed Mar 25 2009 - 08:37:58 EST


On Tue 24-03-09 04:12:49, Andrew Morton wrote:
> On Tue, 24 Mar 2009 11:31:11 +0100 Ingo Molnar <mingo@xxxxxxx> wrote:
> > The thing is ... this is a _bad_ ext3 design bug affecting ext3
> > users in the last decade or so of ext3 existence. Why is this issue
> > not handled with the utmost high priority and why wasnt it fixed 5
> > years ago already? :-)
> >
> > It does not matter whether we have extents or htrees when there are
> > _trivially reproducible_ basic usability problems with ext3.
> >
>
> It's all there in that Oct 2008 thread.
>
> The proposed tweak to kjournald is a bad fix - partly because it will
> elevate the priority of vast amounts of IO whose priority we don't _want_
> elevated.
>
> But mainly because the problem lies elsewhere - in an area of contention
> between the committing and running transactions which we knowingly and
> reluctantly added to fix a bug in
>
> commit 773fc4c63442fbd8237b4805627f6906143204a8
> Author: akpm <akpm>
> AuthorDate: Sun May 19 23:23:01 2002 +0000
> Commit: akpm <akpm>
> CommitDate: Sun May 19 23:23:01 2002 +0000
>
> [PATCH] fix ext3 buffer-stealing
>
> Patch from sct fixes a long-standing (I did it!) and rather complex
> problem with ext3.
>
> The problem is to do with buffers which are continually being dirtied
> by an external agent. I had code in there (for easily-triggerable
> livelock avoidance) which steals the buffer from checkpoint mode and
> reattaches it to the running transaction. This violates ext3 ordering
> requirements - it can permit journal space to be reclaimed before the
> relevant data has really been written out.
>
> Also, we do have to reliably get a lock on the buffer when moving it
> between lists and inspecting its internal state. Otherwise a competing
> read from the underlying block device can trigger an assertion failure,
> and a competing write to the underlying block device can confuse ext3
> journalling state completely.
I've looked at this a bit. I suppose you mean the contention arising from
us taking the buffer lock in do_get_write_access()? But it's not obvious
to me why we'd be contending there... We call this function only for
metadata buffers (unless in data=journal mode) so there isn't huge amount
of these blocks. This buffer should be locked for a longer time only when
we do writeout for checkpoint (hmm, maybe you meant this one?). In
particular, note that we don't take the buffer lock when committing this
block to journal - we lock only the BJ_IO buffer. But in this case we wait
when the buffer is on BJ_Shadow list later so there is some contention in
this case.
Also when I emailed with a few people about these sync problems, they
wrote that switching to data=writeback mode helps considerably so this
would indicate that handling of ordered mode data buffers is causing most
of the slowdown...

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/