Re: [patch 0/5] mm: per-zone dirty limiting

From: Mel Gorman
Date: Wed Aug 03 2011 - 09:18:23 EST


On Tue, Aug 02, 2011 at 02:17:33PM +0200, Johannes Weiner wrote:
> On Fri, Jul 29, 2011 at 12:05:10PM +0100, Mel Gorman wrote:
> > On Tue, Jul 26, 2011 at 08:05:59PM +0200, Johannes Weiner wrote:
> > > > As dd is variable, I'm rerunning the tests to do 4 iterations and
> > > > multiple memory sizes for just xfs and ext4 to see what falls out. It
> > > > should take about 14 hours to complete assuming nothing screws up.
> > >
> > > Awesome, thanks!
> > >
> >
> > While they in fact took about 30 hours to complete, I only got around
> > to packaging them up now. Unfortuantely the tests were incomplete as
> > I needed the machine back for another use but the results that did
> > complete are at http://www.csn.ul.ie/~mel/postings/hnaz-20110729/
> >
> > Look for the comparison.html files such as this one
> >
> > http://www.csn.ul.ie/~mel/postings/hnaz-20110729/global-dhp-512M__writeback-reclaimdirty-ext3/hydra/comparison.html
> >
> > I'm afraid I haven't looked through them in detail.
>
> Mel, thanks a lot for running those tests, you shall be compensated in
> finest brewery goods some time.
>

Sweet.

> Here is an attempt:
>
> global-dhp-512M__writeback-reclaimdirty-xfs
>
> SIMPLE WRITEBACK
> simple-writeback writeback-3.0.0 writeback-3.0.0 3.0.0-lessks
> 3.0.0-vanilla lesskswapd-v3r1 perzonedirty-v1r1 pzdirty-v3r1
> 1 1054.54 ( 0.00%) 386.65 (172.74%) 375.60 (180.76%) 375.88 (180.55%)
> +/- 1.41% 4.56% 3.09% 2.34%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds) 32.27 29.97 30.65 30.91
> Total Elapsed Time (seconds) 4220.48 1548.84 1504.64 1505.79
>
> MMTests Statistics: vmstat
> Page Ins 720433 392017 317097 343849
> Page Outs 27746435 27673017 27619134 27555437
> Swap Ins 173563 94196 74844 81954
> Swap Outs 115864 100264 86833 70904
> Direct pages scanned 3268014 7515 0 1008
> Kswapd pages scanned 5351371 12045948 7973273 7923387
> Kswapd pages reclaimed 3320848 6498700 6486754 6492607
> Direct pages reclaimed 3267145 7243 0 1008
> Kswapd efficiency 62% 53% 81% 81%
> Kswapd velocity 1267.953 7777.400 5299.123 5261.947
> Direct efficiency 99% 96% 100% 100%
> Direct velocity 774.323 4.852 0.000 0.669
> Percentage direct scans 37% 0% 0% 0%
> Page writes by reclaim 130541 100265 86833 70904
> Page writes file 14677 1 0 0
> Page writes anon 115864 100264 86833 70904
> Page reclaim invalidate 0 3120195 0 0
> Slabs scanned 8448 8448 8576 8448
> Direct inode steals 0 0 0 0
> Kswapd inode steals 1828 1837 2056 1918
> Kswapd skipped wait 0 1 0 0
> Compaction stalls 2 0 0 0
> Compaction success 1 0 0 0
> Compaction failures 1 0 0 0
> Compaction pages moved 0 0 0 0
> Compaction move failure 0 0 0 0
>
> While file writes from reclaim are prevented by both patches on their
> own, perzonedirty decreases the amount of anonymous pages swapped out
> because reclaim is always able to make progress instead of wasting its
> file scan budget on shuffling dirty pages.

Good observation and it's related to the usual problem of balancing
multiple LRU lists and what the consequences can be. I had wondered
if it was worth moving dirty pages that were marked PageReclaim to
a separate LRU list but worried that young clean file pages would be
reclaimed before old anonymous pages as a result.

> With lesskswapd in
> addition, swapping is throttled in reclaim by the ratio of dirty pages
> to isolated pages.
>
> The runtime improvements speak for both perzonedirty and
> perzonedirty+lesskswapd. Given the swap upside and increased reclaim
> efficiency, the combination of both appears to be the most desirable.
>
> global-dhp-512M__writeback-reclaimdirty-ext3
>

Agreed.

> SIMPLE WRITEBACK
> simple-writeback writeback-3.0.0 writeback-3.0.0
> 3.0.0-vanilla lesskswapd-v3r1 perzonedirty-v1r1
> 1 1762.23 ( 0.00%) 987.73 (78.41%) 983.82 (79.12%)
> +/- 4.35% 2.24% 1.56%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds) 46.36 44.07 46
> Total Elapsed Time (seconds) 7053.28 3956.60 3940.39
>
> MMTests Statistics: vmstat
> Page Ins 965236 661660 629972
> Page Outs 27984332 27922904 27715628
> Swap Ins 231181 158799 137341
> Swap Outs 151395 142150 88644
> Direct pages scanned 2749884 11138 1315072
> Kswapd pages scanned 6340921 12591169 6599999
> Kswapd pages reclaimed 3915635 6576549 5264406
> Direct pages reclaimed 2749002 10877 1314842
> Kswapd efficiency 61% 52% 79%
> Kswapd velocity 899.003 3182.320 1674.961
> Direct efficiency 99% 97% 99%
> Direct velocity 389.873 2.815 333.742
> Percentage direct scans 30% 0% 16%
> Page writes by reclaim 620698 142155 88645
> Page writes file 469303 5 1
> Page writes anon 151395 142150 88644
> Page reclaim invalidate 0 3717819 0
> Slabs scanned 8704 8576 33408
> Direct inode steals 0 0 466
> Kswapd inode steals 1872 2107 2115
> Kswapd skipped wait 0 1 0
> Compaction stalls 2 0 1
> Compaction success 1 0 0
> Compaction failures 1 0 1
> Compaction pages moved 0 0 0
> Compaction move failure 0 0 0
>
> perzonedirty the highest reclaim efficiencies, the lowest writeout
> counts from reclaim, and the shortest runtime.
>
> While file writes are practically gone with both lesskswapd and
> perzonedirty on their own, the latter also reduces swapping by 40%.
>

Similar observation as before - fewer anonymous pages are being
reclaimed. This should also have a positive effect when writing to a USB
stick and avoiding distruption of running applications.

I do note that there were a large number of pages direct reclaimed
though. It'd be worth keeping an eye on stall times there be it due to
congestion or similar due to page allocator latency.

> I expect the combination of both series to have the best results here
> as well.
>

Quite likely. I regret the combination tests did not have a chance to
run but I'm sure there will be more than one revision.

> global-dhp-512M__writeback-reclaimdirty-ext4
>
> SIMPLE WRITEBACK
> simple-writeback writeback-3.0.0 writeback-3.0.0
> 3.0.0-vanilla lesskswapd-v3r1 perzonedirty-v1r1
> 1 405.42 ( 0.00%) 410.48 (-1.23%) 401.77 ( 0.91%)
> +/- 3.62% 4.45% 2.82%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds) 31.25 31.4 31.37
> Total Elapsed Time (seconds) 1624.60 1644.56 1609.67
>
> MMTests Statistics: vmstat
> Page Ins 354364 403612 332812
> Page Outs 27607792 27709096 27536412
> Swap Ins 84065 96398 79219
> Swap Outs 83096 108478 65342
> Direct pages scanned 112 0 56
> Kswapd pages scanned 12207898 12063862 7615377
> Kswapd pages reclaimed 6492490 6504947 6486946
> Direct pages reclaimed 112 0 56
> Kswapd efficiency 53% 53% 85%
> Kswapd velocity 7514.402 7335.617 4731.018
> Direct efficiency 100% 100% 100%
> Direct velocity 0.069 0.000 0.035
> Percentage direct scans 0% 0% 0%
> Page writes by reclaim 3076760 108483 65342
> Page writes file 2993664 5 0
> Page writes anon 83096 108478 65342
> Page reclaim invalidate 0 3291697 0
> Slabs scanned 8448 8448 8448
> Direct inode steals 0 0 0
> Kswapd inode steals 1979 1993 1945
> Kswapd skipped wait 1 0 0
> Compaction stalls 0 0 0
> Compaction success 0 0 0
> Compaction failures 0 0 0
> Compaction pages moved 0 0 0
> Compaction move failure 0 0 0
>
> With lesskswapd, both runtime and swapouts increased. My only guess
> is that in this configuration, the writepage calls actually improve
> things to a certain extent.
>

A possible explanation is that file pages are being skipped but still
accounted for as scanned. shrink_zone is called() more as a result and
the anonymous lists are being shrunk more relate to the file lists.
One way to test the theory would be to not count dirty pages marked
PageReclaim as scanned.

> Otherwise, nothing stands out to me here, and the same as above
> applies wrt runtime and reclaim efficiency being the best with
> perzonedirty.
>
> global-dhp-1024M__writeback-reclaimdirty-ext3
>
> SIMPLE WRITEBACK
> simple-writeback writeback-3.0.0 writeback-3.0.0
> 3.0.0-vanilla lesskswapd-v3r1 perzonedirty-v1r1
> 1 1291.74 ( 0.00%) 1034.56 (24.86%) 1023.04 (26.26%)
> +/- 2.77% 1.98% 4.42%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds) 42.41 41.97 43.49
> Total Elapsed Time (seconds) 5176.73 4142.26 4096.57
>
> MMTests Statistics: vmstat
> Page Ins 27856 24392 23292
> Page Outs 27360416 27352736 27352700
> Swap Ins 1 6 0
> Swap Outs 2 39 32
> Direct pages scanned 5899 0 0
> Kswapd pages scanned 6500396 7948564 6014854
> Kswapd pages reclaimed 6008477 6012586 6013794
> Direct pages reclaimed 5899 0 0
> Kswapd efficiency 92% 75% 99%
> Kswapd velocity 1255.695 1918.895 1468.266
> Direct efficiency 100% 100% 100%
> Direct velocity 1.140 0.000 0.000
> Percentage direct scans 0% 0% 0%
> Page writes by reclaim 181091 39 32
> Page writes file 181089 0 0
> Page writes anon 2 39 32
> Page reclaim invalidate 0 1843189 0
> Slabs scanned 3840 3840 4096
> Direct inode steals 0 0 0
> Kswapd inode steals 0 0 0
> Kswapd skipped wait 0 0 0
> Compaction stalls 0 0 0
> Compaction success 0 0 0
> Compaction failures 0 0 0
> Compaction pages moved 0 0 0
> Compaction move failure 0 0 0
>
> Writes from reclaim are reduced to practically nothing by both
> patchsets, but perzonedirty standalone wins in runtime and reclaim
> efficiency.
>

Yep, the figures do support the patchset being brought to completion
assuming the issues like lowmem pressure and any risk assocated with
using wakeup_flusher_threads can be ironed out.

> global-dhp-1024M__writeback-reclaimdirty-ext4
>
> <SNIP, looks good>
>
> global-dhp-4608M__writeback-reclaimdirty-ext3
>
> SIMPLE WRITEBACK
> simple-writeback writeback-3.0.0 writeback-3.0.0
> 3.0.0-vanilla lesskswapd-v3r1 perzonedirty-v1r1
> 1 1274.37 ( 0.00%) 1204.00 ( 5.84%) 1317.79 (-3.29%)
> +/- 2.02% 2.03% 3.05%
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds) 43.93 44.4 45.85
> Total Elapsed Time (seconds) 5130.22 4824.17 5278.84
>
> MMTests Statistics: vmstat
> Page Ins 44004 43704 44492
> Page Outs 27391592 27386240 27390108
> Swap Ins 6968 5855 6091
> Swap Outs 8846 8024 8065
> Direct pages scanned 0 0 115384
> Kswapd pages scanned 4234168 4656846 4105795
> Kswapd pages reclaimed 3899101 3893500 3776056
> Direct pages reclaimed 0 0 115347
> Kswapd efficiency 92% 83% 91%
> Kswapd velocity 825.338 965.315 777.784
> Direct efficiency 100% 100% 99%
> Direct velocity 0.000 0.000 21.858
> Percentage direct scans 0% 0% 2%
> Page writes by reclaim 42555 8024 40622
> Page writes file 33709 0 32557
> Page writes anon 8846 8024 8065
> Page reclaim invalidate 0 586463 0
> Slabs scanned 3712 3840 3840
> Direct inode steals 0 0 0
> Kswapd inode steals 0 0 0
> Kswapd skipped wait 0 0 0
> Compaction stalls 0 0 0
> Compaction success 0 0 0
> Compaction failures 0 0 0
> Compaction pages moved 0 0 0
> Compaction move failure 0 0 0
>
> Here, perzonedirty fails to ensure enough clean pages in what I guess
> is a small Normal zone on top of the DMA32 zone. The
> (not-yet-optimized) per-zone dirty checks cost CPU time but they do
> not pay off and dirty pages are still encountered by reclaim.
>
> Mel, can you say how big exactly the Normal zone is with this setup?
>

Normal zone == 129280 pages == 505M. DMA32 is 701976 pages or
2742M. Not small enough to cause the worse of problems related to a
smallest upper zone admittedly but enough to cause a lot of direct
reclaim activity with plenty of writing files back.

> My theory is that the closer (file_pages - dirty_pages) is to the high
> watermark which kswapd tries to balance to, the more likely it is to
> run into dirty pages. And to my knowledge, these tests are run with a
> non-standard 40% dirty ratio, which lowers the threshold at which
> perzonedirty falls apart. Per-zone dirty limits should probably take
> the high watermark into account.
>

That would appear sensible. The choice of 40% dirty ratio is deliberate.
My understanding is a number of servers that are IO intensive will have
dirty ratio tuned to this value. On bug reports I've seen for distro
kernels related to IO slowdowns, it seemed to be a common choice. I
suspect it's tuned to this because it used to be the old default. Of
course, 40% also made the writeback problem worse so the effect of the
patches is easier to see.

> This does not explain the regression to me, however, if the Normal
> zone here is about the same size as the DMA32 zone in the 512M tests
> above, for which perzonedirty was an unambiguous improvement.
>

The Normal zone is not the same size as DMA32 so scratch that.

Note that the slowdown here is small. The vanilla kernel is finishes
in 1274.37 +/ 2.04%. Your patches result are 1317.79 +/ 3.05% so there
is some overlap. kswapd is less aggressive and direct reclaim is used
more which might be sufficient to explain the slowdown. An avenue of
investigation is why kswapd is reclaiming so much less. It can't be
just the use of writepage or the vanilla kernel would show similar
scan and reclaim rates.

> What makes me wonder, is that in addition, something in perzonedirty
> makes kswapd less efficient in the 4G tests, which is the opposite
> effect it had in all other setups. This increases direct reclaim
> invocations against the preferred Normal zone. The higher pressure
> could also explain why reclaim rushes through the clean pages and runs
> into dirty pages quicker.
>
> Does anyone have a theory about what might be going on here?
>

This is tenuous at best and I confess I have not thought deeply
about it but it could be due to the relative age of the pages in the
highest zone.

In the vanilla kernel, the Normal zone gets filled with dirty pages
first and then the lower zones get used up until dirty ratio when
flusher threads get woken. Because the highest zone also has the
oldest pages and presumably the oldest inodes, the zone gets fully
cleaned by the flusher. The pattern is "fill zone with dirty pages,
use lower zones, highest zone gets fully cleaned reclaimed and refilled
with dirty pages, repeat"

In the patched kernel, lower zones are used when the dirty limits of a
zone are met and the flusher threads are woken to clean a small number
of pages but not the full zone. Reclaim takes the clean pages and they
get replaced with younger dirty pages. Over time, the highest zone
becomes a mix of old and young dirty pages. The flusher threads run
but instead of cleaning the highest zone first, it is cleaning a mix
of pages both all the zones. If this was the case, kswapd would end
up writing more pages from the higher zone and stalling as a result.

A further problem could be that direct reclaimers are hitting that new
congestion_wait(). Unfortunately, I was not running with stats enabled
to see what the congestion figures looked like.

> The tests with other filesystems on 4G memory look similarly bleak for
> perzonedirty:
>
> global-dhp-4608M__writeback-reclaimdirty-ext4
>
> <SNIP>
>
> I am doubly confused because I ran similar tests with 4G memory and
> got contradicting results. Will rerun those to make sure.
>
> Comments?

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/