Re: [PATCH 0/3] bdi write bandwidth estimation

From: Wu Fengguang
Date: Mon Jun 13 2011 - 23:45:36 EST


On Tue, Jun 14, 2011 at 06:23:30AM +0800, Andrew Morton wrote:
> On Sun, 12 Jun 2011 23:18:21 +0800
> Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
>
> > Do bdi write bandwidth estimation in the flusher thread at 200ms intervals,
>
> stdrant: anything which is paced using "seconds" is basically always
> wrong. The bandwidth of storage systems varies by who-knows-how-many
> orders of magnitude. If 200ms is correct for one system then it is
> vastly incorrect for another.
>
> A more suitable clock for this estimate would be "per 200 requests",
> for a block-based BDI.
>
> Also of course the bandwidth of a particular BDI varies vastly
> depending on workload. For the purpose of this work, that's probably
> a desirable thing.

It would be good to be able to get more timely estimation for fast
devices. However have to balance between "timely" and "fluctuations"..

The main problem is, IO completions may come in bursts. The NFS commit
can be as large as seconds worth of data. The XFS completions may be
half second worth of data if we are going to increase the write chunk
size to half second worth of data.

Looking at the other filesystems, eg. ext4

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/ext4-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:57/balance_dirty_pages-bandwidth.png

You'll notice fluctuations with the time period of around 5 seconds.

Here is another pattern with irregular periods of up to 20 seconds on SSD:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1SSD-64G/ext4-1dd-1M-64p-64288M-20%25-2.6.38-rc6-dt6+-2011-03-01-16-19/balance_dirty_pages-bandwidth.png

That's why I'm not only doing the estimation at 200ms intervals, but
also averaging them over a period of 3 seconds and then go further to
do another level of smoothing (the avg_write_bandwidth).

Since it's a reasonable optimization for the filesystems to do IO
completions in batches, the time based interval would be suitable to
average out the bursts and being efficient enough for both fast/slow
storages.


Another important fact is: the estimation is carried out on every
200ms when the flusher thread is _already busy_.

In this way, it won't lead to pointless CPU wakeups at idle time.

The estimated bandwidth will be reflecting how fast the device can
writeout when fully utilized, so won't drop to 0 when it goes idle.
The value will remain constant at disk idle time. At busy write time,
if not considering fluctuations, it will also remain high unless be
knocked down by possible concurrent reads that take some disk time and
bandwidth away.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/