Re: regression in page writeback

From: Wu Fengguang
Date: Thu Oct 01 2009 - 11:14:50 EST


Ted,

On Wed, Sep 30, 2009 at 10:11:58PM +0800, Theodore Ts'o wrote:
> On Wed, Sep 30, 2009 at 01:26:57PM +0800, Wu Fengguang wrote:
> > It's good to increase MAX_WRITEBACK_PAGES, however I'm afraid
> > max_contig_writeback_mb may be a burden in future: either it is not
> > necessary, or a per-bdi counterpart must be introduced for all
> > filesystems.
>
> The per-filesystem tunable was just a short-term hack; the reason why
> I did it that way was it was clear that a global tunable wouldn't fly,
> and rightly so --- what might be suitable for a slow USB stick might
> be very different than a super-fast RAID array, and someone might very
> well have both on the same system.

Ah Yes.

> > And it's preferred to automatically handle slow devices well with the
> > increased chunk size, instead of adding another parameter.
>
> Agreed; long-term what we probably need is something which is
> automatically tunable. My thinking was that we should tune the the
> initial nr_to_write parameter based on how many blocks could be
> written in some time interval, which is tunable. So if we decide that
> 1 second is a suitable time period to be writing out one inode's dirty
> pages, then for a fast server-class SATA disk, we might want to set
> nr_to_write to be around 128mb worth of pages. For a laptop SATA
> disk, it might be around 64mb, and for a really slow USB stick, it
> might be more like 16mb. For super-fast enterprise RAID array, 128mb
> might be too small!

Yes, 128MB may be too small :)

> If we get timing and/or congestion information from the block layer,
> it wouldn't be hard to figure out the optimal number of pages that
> should be sent down to the filesystem, and to tune this automatically.

Sure, it's possible.

> > I scratched up a patch to demo the ideas collected in recent discussions.
> > Can you check if it serves your needs? Thanks.
>
> Sure, I'll definitely play with it, thanks.

Thanks :)

> > The wbc.timeout (when used per-file) is mainly a safeguard against slow
> > devices, which may take too long time to sync 128MB data.
>
> Maybe I'm missing something, but I don't think the wbc.timeout
> approach is sufficient. Consider the scenario of someone who is
> ripping a DVD disc to an 8 gig USB stick. The USB stick will be very
> slow, but since the file is contiguous the filesystem will very
> happily try to push it out there 128MB at a time, and wbc.timeout
> value isn't really going to help since a single call to writepages
> could easily cause 128MB worth of data to be streamed out to the USB
> stick.

Yes and no. Yes if the queue was empty for the slow device. No if the
queue was full, in which case IO submission speed = IO complete speed
for previously queued requests.

So wbc.timeout will be accurate for IO submission time, and mostly
accurate for IO completion time. The transient queue fill up phase
shall not be a big problem?

> This is why the MAX_WRITEBACK_PAGES really needs to be tuned on a
> per-bdi basis; either manually, via a sysfs tunable, or automatically,
> by auto-tuning based on how fast the storage device is or by some kind
> of congestion-based approach. This is certainly the best long-term
> solution; my concern was that it might take a long-time for us to get
> the auto-tunable just right, so in the meantime I added a
> per-mounted-filesystem tunable and put the hack in the filesystem
> layer. I would like nothing better than to rip it out, once we have a
> long-term solution.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/