Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

From: Wu Fengguang
Date: Fri Jul 31 2009 - 23:01:23 EST


On Fri, Jul 31, 2009 at 04:33:09AM +0800, Martin Bligh wrote:
> (BTW: background ... I'm not picking through this code for fun, I'm
> trying to debug writeback problems introduced in our new kernel
> that are affecting Google production workloads ;-))
>
> >> Well, I see two problems. One is that we set more_io based on
> >> whether s_more_io is empty or not before we finish the loop.
> >> I can't see how this can be correct, especially as there can be
> >> other concurrent writers. So somehow we need to check when
> >> we exit the loop, not during it.
> >
> > It is correct inside the loop, however with some overheads.
> >
> > We put it inside the loop because sometimes the whole filesystem is
> > skipped and we shall not set more_io on them whether or not s_more_io
> > is empty.
>
> My point was that you're setting more_io based on a condition
> at a point in time that isn't when you return to the caller.
>
> By the time you return to the caller (after several more loops
> iterations), that condition may no longer be true.

You are right in that sense. Sorry that my claim of correctness is
somehow biased: we normally care much about early abortion, and don't
mind one extra trip over the superblocks. And the extra trip should be
rare enough. I'd be surprised if you observed much of them in real
workloads.

> One other way to address that would to be only to set if if we're
> about to fall off the end of the loop, ie change it to:
>
> if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io))
> wbc->more_io = 1;

Let more_io=0 when there are more inodes in s_io to be worked on?
I cannot understand it, and suspect we are talking about imaginary
problem on this point ;)

> >> The other is that we're saying we are setting more_io when
> >> nr_to_write is <=0 ... but we only really check it when
> >> nr_to_write is > 0 ... I can't see how this can be useful?
> >
> > That's the caller's fault - I guess the logic was changed a bit by
> > Jens in linux-next. I noticed this just now. It shall be fixed.
>
> I am guessing you're setting more_io here because we're stopping
> because our slice expired, presumably without us completing
> all the io there was to do? That doesn't seem entirely accurate,
> we could have finished all the pending IO (particularly given that
> we can go over nr_to_write somewhat and send it negative).
> Hence, I though that checking whether s_more_io and s_io were
> empty at the time of return might be a more accurate check,
> but on the other hand they are shared lists.

Yes the current more_io logic is not entirely accurate, but I doubt we
can gain much and the improvement can be done trivially (not the line
of code, but the analyzes and tests involved).

Anyway if you would take the time to push forward a patch for reducing
the overheads of possible extra trips, I'll take the time to review it ;)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/