Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()

From: Wu Fengguang
Date: Wed Aug 10 2011 - 00:33:14 EST


On Wed, Aug 10, 2011 at 03:16:22AM +0800, Vivek Goyal wrote:
> On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:
>
> [..]
> > -/*
> > - * task_dirty_limit - scale down dirty throttling threshold for one task
> > - *
> > - * task specific dirty limit:
> > - *
> > - * dirty -= (dirty/8) * p_{t}
> > - *
> > - * To protect light/slow dirtying tasks from heavier/fast ones, we start
> > - * throttling individual tasks before reaching the bdi dirty limit.
> > - * Relatively low thresholds will be allocated to heavy dirtiers. So when
> > - * dirty pages grow large, heavy dirtiers will be throttled first, which will
> > - * effectively curb the growth of dirty pages. Light dirtiers with high enough
> > - * dirty threshold may never get throttled.
> > - */
>
> Hi Fengguang,
>
> So we have got rid of the notion of per task dirty limit based on their
> fraction? What replaces it.

It's simply removed :)

> I can't see any code which is replacing it.

The think time compensation feature (patch attached) will be providing
the same protection for light/slow dirtiers. With it, the slower
dirtiers won't be throttled at all, because the pause time calculated
by

period = pages_dirtied / rate
pause = period - think

will be <= 0.

For example, given write_bw = 100MB/s and

- 2 dd tasks that dirty pages as fast as possible
- 1 scp whose dirty rate is limited by network bandwidth 10MB/s

Then with think time compensation, the real dirty rates will be

- 2 dd tasks: (100-10)/2 = 45MB/s (each)
- 1 scp task: 10MB/s

The scp task won't be throttled by balance_dirty_pages() any more.
This is a tested feature. In the below graph, the dirty rate (the
slope of the lines) of the last 3 tasks are 2, 4, 8 MB/s

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/RATES-2-4-8/btrfs-fio-rates-128k-8p-2975M-2.6.38-rc6-dt6+-2011-03-01-20-45/balance_dirty_pages-task-bw.png

given this fio workload, which started one full speed dirtier and
four 1, 2, 4, 8 MB/s rate limited dirtiers

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/RATES-2-4-8/btrfs-fio-rates-128k-8p-2975M-2.6.38-rc6-dt6+-2011-03-01-20-45/fio-rates

> If yes, I am wondering how
> do you get fairness among tasks which share this bdi.
>
> Also wondering what did this patch series to do make sure that tasks
> share bdi more fairly and get write_bw/N bandwidth.

Each of the N dd tasks will be rate limited by

rate = base_rate * pos_ratio

At any time snapshot, each bdi task will see almost the same base_rate
and pos_ratio, so will be throttled almost at the same rate. This is a
strong guarantee of fairness under all situations.

Since pos_ratio is fluctuating (evenly) around 1.0, and
base_rate=bdi->dirty_ratelimit is fluctuating around (write_bw/N),
on average we get

avg_rate = (write_bw/N) * 1.0

(I'll explain the "dirty_ratelimit = write_bw/N" magic other emails.)

The below graphs demonstrate the dirty progress of the last 3 dd tasks.
The slope of each curve is the dirty rate.

They vividly show three curves progressing at the same pace in all of
the 3 stages

- rampup stage (20-100s)

- disturbed stage (120s-160s)
(disturbed by starting a 1GB read dd in the middle of the tests)

- stable stage (after 160s)

And dirtied almost the same amount of pages during the test.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/8G/xfs-10dd-4k-32p-6802M-20:10-3.0.0-next-20110802+-2011-08-06.16:26/balance_dirty_pages-task-bw.png

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/2G/xfs-10dd-4k-8p-1947M-20:10-3.0.0-next-20110802+-2011-08-06.15:49/balance_dirty_pages-task-bw.png

Thanks,
Fengguang
Subject: writeback: dirty ratelimit - think time compensation
Date: Sat Jun 11 19:25:42 CST 2011

Compensate the task's think time when computing the final pause time,
so that ->dirty_ratelimit can be executed accurately.

In the rare case that the task slept longer than the period time (result
in negative pause time), the extra sleep time will be compensated in
next period if it's not too big (<500ms).

Accumulated errors are carefully avoided as long as the max pause area
is not hitted.

Pseudo code:

period = pages_dirtied / bw;
think = jiffies - dirty_paused_when;
pause = period - think;

case 1: period > think

pause = period - think
dirty_paused_when += pause

period time
|======================================>|
think time
|===============>|
------|----------------|----------------------|-----------
dirty_paused_when jiffies


case 2: period <= think

don't pause; reduce future pause time by:
dirty_paused_when += period

period time
|=========================>|
think time
|======================================>|
------|--------------------------+------------|-----------
dirty_paused_when jiffies

Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
---
include/linux/sched.h | 1 +
kernel/fork.c | 1 +
mm/page-writeback.c | 34 +++++++++++++++++++++++++++++++---
3 files changed, 33 insertions(+), 3 deletions(-)

--- linux-next.orig/include/linux/sched.h 2011-08-09 07:53:31.000000000 +0800
+++ linux-next/include/linux/sched.h 2011-08-09 07:54:12.000000000 +0800
@@ -1531,6 +1531,7 @@ struct task_struct {
*/
int nr_dirtied;
int nr_dirtied_pause;
+ unsigned long dirty_paused_when; /* start of a write-and-pause period */

#ifdef CONFIG_LATENCYTOP
int latency_record_count;
--- linux-next.orig/mm/page-writeback.c 2011-08-09 07:53:31.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-08-09 08:08:11.000000000 +0800
@@ -817,6 +817,7 @@ static void balance_dirty_pages(struct a
unsigned long background_thresh;
unsigned long dirty_thresh;
unsigned long bdi_thresh;
+ long period;
long pause = 0;
bool dirty_exceeded = false;
unsigned long bw;
@@ -825,6 +826,8 @@ static void balance_dirty_pages(struct a
unsigned long start_time = jiffies;

for (;;) {
+ unsigned long now = jiffies;
+
/*
* Unstable writes are a feature of certain networked
* filesystems (i.e. NFS) in which data may have been
@@ -842,8 +845,11 @@ static void balance_dirty_pages(struct a
* catch-up. This avoids (excessively) small writeouts
* when the bdi limits are ramping up.
*/
- if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+ if (nr_dirty <= (background_thresh + dirty_thresh) / 2) {
+ current->dirty_paused_when = now;
+ current->nr_dirtied = 0;
break;
+ }

bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);

@@ -879,17 +885,40 @@ static void balance_dirty_pages(struct a
bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
bdi_thresh, bdi_dirty);
if (unlikely(bw == 0)) {
+ period = MAX_PAUSE;
pause = MAX_PAUSE;
goto pause;
}
bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
- pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
+ period = (HZ * pages_dirtied + bw / 2) / (bw | 1);
+ pause = current->dirty_paused_when + period - now;
+ /*
+ * For less than 1s think time (ext3/4 may block the dirtier
+ * for up to 800ms from time to time on 1-HDD; so does xfs,
+ * however at much less frequency), try to compensate it in
+ * future periods by updating the virtual time; otherwise just
+ * do a reset, as it may be a light dirtier.
+ */
+ if (unlikely(pause <= 0)) {
+ if (pause < -HZ) {
+ current->dirty_paused_when = now;
+ current->nr_dirtied = 0;
+ } else if (period) {
+ current->dirty_paused_when += period;
+ current->nr_dirtied = 0;
+ }
+ pause = 1; /* avoid resetting nr_dirtied_pause below */
+ break;
+ }
pause = min(pause, MAX_PAUSE);

pause:
__set_current_state(TASK_UNINTERRUPTIBLE);
io_schedule_timeout(pause);

+ current->dirty_paused_when = now + pause;
+ current->nr_dirtied = 0;
+
dirty_thresh = hard_dirty_limit(dirty_thresh);
/*
* max-pause area. If dirty exceeded but still within this
@@ -916,7 +945,6 @@ pause:
if (!dirty_exceeded && bdi->dirty_exceeded)
bdi->dirty_exceeded = 0;

- current->nr_dirtied = 0;
current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh);

if (writeback_in_progress(bdi))
--- linux-next.orig/kernel/fork.c 2011-08-09 07:53:31.000000000 +0800
+++ linux-next/kernel/fork.c 2011-08-09 07:54:12.000000000 +0800
@@ -1303,6 +1303,7 @@ static struct task_struct *copy_process(

p->nr_dirtied = 0;
p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+ p->dirty_paused_when = 0;

/*
* Ok, make it visible to the rest of the system.