Re: [PATCH 3/6] writeback: sync expired inodes first in backgroundwriteback

From: Wu Fengguang
Date: Tue Apr 19 2011 - 08:56:25 EST


On Tue, Apr 19, 2011 at 05:57:40PM +0800, Jan Kara wrote:
> On Tue 19-04-11 17:35:23, Dave Chinner wrote:
> > On Tue, Apr 19, 2011 at 11:00:06AM +0800, Wu Fengguang wrote:
> > > A background flush work may run for ever. So it's reasonable for it to
> > > mimic the kupdate behavior of syncing old/expired inodes first.
> > >
> > > The policy is
> > > - enqueue all newly expired inodes at each queue_io() time
> > > - enqueue all dirty inodes if there are no more expired inodes to sync
> > >
> > > This will help reduce the number of dirty pages encountered by page
> > > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > > dirty pages, which are more close to the end of the LRU lists. So
> > > syncing older inodes first helps reducing the dirty pages reached by
> > > the page reclaim code.
> >
> > Once again I think this is the wrong place to be changing writeback
> > policy decisions. for_background writeback only goes through
> > wb_writeback() and writeback_inodes_wb() (same as for_kupdate
> > writeback), so a decision to change from expired inodes to fresh
> > inodes, IMO, should be made in wb_writeback.
> >
> > That is, for_background and for_kupdate writeback start with the
> > same policy (older_than_this set) to writeback expired inodes first,
> > then when background writeback runs out of expired inodes, it should
> > switch to all remaining inodes by clearing older_than_this instead
> > of refreshing it for the next loop.
> Yes, I agree with this and my impression is that Fengguang is trying to
> achieve exactly this behavior.
>
> > This keeps all the policy decisions in the one place, all using the
> > same (existing) mechanism, and all relatively simple to understand,
> > and easy to tracepoint for debugging. Changing writeback policy
> > deep in the writeback stack is not a good idea as it will make
> > extending writeback policies in future (e.g. for cgroup awareness)
> > very messy.
> Hmm, I see. I agree the policy decisions should be at one place if
> reasonably possible. Fengguang moves them from wb_writeback() to inode
> queueing code which looks like a logical place to me as well - there we
> have the largest control over what inodes do we decide to write and don't
> have to pass all the detailed 'instructions' down in wbc structure. So if
> we later want to add cgroup awareness to writeback, I imagine we just add
> the knowledge to inode queueing code.

I actually started with wb_writeback() as a natural choice, and then
found it much easier to do the expired-only=>all-inodes switching in
move_expired_inodes() since it needs to know the @b_dirty and @tmp
lists' emptiness to trigger the switch. It's not sane for
wb_writeback() to look into such details. And once you do the switch
part in move_expired_inodes(), the whole policy naturally follows.

> > > @@ -585,7 +597,8 @@ void writeback_inodes_wb(struct bdi_writ
> > > if (!wbc->wb_start)
> > > wbc->wb_start = jiffies; /* livelock avoidance */
> > > spin_lock(&inode_wb_list_lock);
> > > - if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > +
> > > + if (list_empty(&wb->b_io))
> > > queue_io(wb, wbc);
> > >
> > > while (!list_empty(&wb->b_io)) {
> > > @@ -612,7 +625,7 @@ static void __writeback_inodes_sb(struct
> > > WARN_ON(!rwsem_is_locked(&sb->s_umount));
> > >
> > > spin_lock(&inode_wb_list_lock);
> > > - if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > + if (list_empty(&wb->b_io))
> > > queue_io(wb, wbc);
> > > writeback_sb_inodes(sb, wb, wbc, true);
> > > spin_unlock(&inode_wb_list_lock);
> >
> > That changes the order in which we queue inodes for writeback.
> > Instead of calling every time to move b_more_io inodes onto the b_io
> > list and expiring more aged inodes, we only ever do it when the list
> > is empty. That is, it seems to me that this will tend to give
> > b_more_io inodes a smaller share of writeback because they are being
> > moved back to the b_io list less frequently where there are lots of
> > other inodes being dirtied. Have you tested the impact of this
> > change on mixed workload performance? Indeed, can you starve
> > writeback of a large file simply by creating lots of small files in
> > another thread?
> Yeah, this change looks suspicious to me as well.

The exact behaviors are indeed rather complex. I personally feel the
new "always refill iff empty" policy more consistent, clean and easy
to understand.

It basically says: at each round started by a b_io refill, setup a
_fixed_ work set with all current expired (or all currently dirtied
inodes if non is expired) and walk through it. "Fixed" work set means
no new inodes will be added to the work set during the walk. When a
complete walk is done, start over with a new set of inodes that are
eligible at the time.

The figure in page 14 illustrates the "rounds" idea:
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/linux-writeback-queues.pdf

This procedure provides fairness among the inodes and guarantees each
inode to be synced once and only once at each round. So it's free from
starvations.

If you are worried about performance, here is a simple tar+dd benchmark.
Both commands are actually running faster with this patchset:

wfg /tmp% g cpu log-* | g dd
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.26s system 9% cpu 13.658 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.26s system 9% cpu 12.961 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.26s system 9% cpu 13.420 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.30s system 9% cpu 13.103 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.31s system 9% cpu 13.650 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.25s system 8% cpu 15.258 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.26s system 8% cpu 14.255 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.26s system 8% cpu 14.443 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.25s system 8% cpu 14.051 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.27s system 8% cpu 14.648 total

wfg /tmp% g cpu log-* | g tar
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.49s user 3.99s system 60% cpu 27.285 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.78s user 4.40s system 65% cpu 26.125 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.50s user 4.56s system 64% cpu 26.265 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.50s user 4.18s system 62% cpu 26.766 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.60s user 4.03s system 60% cpu 27.463 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.42s user 4.17s system 57% cpu 28.688 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.67s user 4.04s system 58% cpu 28.738 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.53s user 4.50s system 58% cpu 29.287 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.38s user 4.28s system 57% cpu 28.861 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.44s user 4.19s system 56% cpu 29.443 total

Total elapsed time (from tar/dd start to sync complete) is
244.36s vs. 239.91s, also a bit faster with patch.

The base kernel is 2.6.39-rc3+ plus IO-less patchset plus large write
chunk size. The test box has 3G mem and runs XFS. Test script is:

#!/bin/zsh


# we are doing pure write tests
cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/

umount /dev/sda7
mkfs.xfs -f /dev/sda7
mount /dev/sda7 /fs

echo 3 > /proc/sys/vm/drop_caches

echo 1 > /debug/tracing/events/writeback/writeback_single_inode/enable

cat /proc/uptime

cd /fs
time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &

wait
sync
cat /proc/uptime

Thanks,
Fengguang
dt7, no moving target

wfg ~% s fat [ 255 ] :-(
Linux fat 2.6.39-rc3-dt7+ #235 SMP Tue Apr 19 19:33:15 CST 2011 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
No mail.
Last login: Tue Apr 19 19:16:05 2011 from 10.255.20.73
wfg@fat ~% su
root@fat /home/wfg# for i in 1 2 3 4 5; do bin/test-tar-dd.sh; sleep 3; done
umount: /dev/sda7: not mounted
meta-data=/dev/sda7 isize=256 agcount=4, agsize=6170464 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=24681856, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=12051, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
306.70 2423.01
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 15.2306 s, 68.8 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.25s system 8% cpu 15.258 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.42s user 4.17s system 57% cpu 28.688 total
344.05 2662.47
meta-data=/dev/sda7 isize=256 agcount=4, agsize=6170464 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=24681856, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=12051, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
351.63 2721.77
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 14.1873 s, 73.9 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.26s system 8% cpu 14.255 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.67s user 4.04s system 58% cpu 28.738 total
388.94 2963.14
meta-data=/dev/sda7 isize=256 agcount=4, agsize=6170464 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=24681856, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=12051, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
396.53 3024.20
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 14.385 s, 72.9 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.26s system 8% cpu 14.443 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.53s user 4.50s system 58% cpu 29.287 total
434.18 3268.86
meta-data=/dev/sda7 isize=256 agcount=4, agsize=6170464 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=24681856, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=12051, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
441.69 3327.58
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 13.997 s, 74.9 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.25s system 8% cpu 14.051 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.38s user 4.28s system 57% cpu 28.861 total
478.91 3569.24
meta-data=/dev/sda7 isize=256 agcount=4, agsize=6170464 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=24681856, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=12051, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
486.48 3627.06
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 14.5851 s, 71.9 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.27s system 8% cpu 14.648 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.44s user 4.19s system 56% cpu 29.443 total
524.46 3871.42

3871.42 - 3627.06 = 244.36


ext4:

1855.48 14403.91
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.48s user 3.31s system 86% cpu 18.345 total
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.4943 s, 51.2 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.65s system 8% cpu 20.518 total
1884.20 14562.35

14562.35 - 14403.91 = 158.44
dt7, moving target

wfg ~% s fat [ 255 ] :-(
Linux fat 2.6.39-rc3-dt7+ #234 SMP Tue Apr 19 17:23:44 CST 2011 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
No mail.
Last login: Tue Apr 19 17:25:16 2011 from 10.255.20.73
wfg@fat ~% su
root@fat /home/wfg# vi bin/test-tar-dd.sh
root@fat /home/wfg# bin/test-tar-dd.sh
umount: /dev/sda7: not mounted
meta-data=/dev/sda7 isize=256 agcount=4, agsize=6170464 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=24681856, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=12051, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
634.16 5029.23
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 13.6318 s, 76.9 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.26s system 9% cpu 13.658 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.49s user 3.99s system 60% cpu 27.285 total
670.17 5262.84
root@fat /home/wfg# bin/test-tar-dd.sh
meta-data=/dev/sda7 isize=256 agcount=4, agsize=6170464 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=24681856, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=12051, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
678.41 5327.07
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 12.9063 s, 81.2 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.26s system 9% cpu 12.961 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.78s user 4.40s system 65% cpu 26.125 total
713.93 5559.64
root@fat /home/wfg# bin/test-tar-dd.sh
meta-data=/dev/sda7 isize=256 agcount=4, agsize=6170464 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=24681856, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=12051, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
722.54 5626.94
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 13.3658 s, 78.5 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.26s system 9% cpu 13.420 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.50s user 4.56s system 64% cpu 26.265 total
757.98 5855.34
root@fat /home/wfg# bin/test-tar-dd.sh
meta-data=/dev/sda7 isize=256 agcount=4, agsize=6170464 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=24681856, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=12051, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
766.10 5918.93
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 13.0385 s, 80.4 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.30s system 9% cpu 13.103 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.50s user 4.18s system 62% cpu 26.766 total
801.72 6152.51
root@fat /home/wfg#
root@fat /home/wfg# bin/test-tar-dd.sh
meta-data=/dev/sda7 isize=256 agcount=4, agsize=6170464 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=24681856, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=12051, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
994.01 7677.81
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 13.5859 s, 77.2 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.31s system 9% cpu 13.650 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.60s user 4.03s system 60% cpu 27.463 total
1030.08 7917.72

7917.72 - 7677.81 = 239.91