Re: [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2

From: Minchan Kim
Date: Wed Jul 27 2011 - 00:32:31 EST


Hi Mel,

On Fri, Jul 22, 2011 at 1:28 AM, Mel Gorman <mgorman@xxxxxxx> wrote:
> Warning: Long post with lots of figures. If you normally drink coffee
> and you don't have a cup, get one or you may end up with a case of
> keyboard face.
>
> Changelog since v1
> Âo Drop prio-inode patch. There is now a dependency that the flusher
> Â Âthreads find these dirty pages quickly.
> Âo Drop nr_vmscan_throttled counter
> Âo SetPageReclaim instead of deactivate_page which was wrong
> Âo Add warning to main filesystems if called from direct reclaim context
> Âo Add patch to completely disable filesystem writeback from reclaim
>
> Testing from the XFS folk revealed that there is still too much
> I/O from the end of the LRU in kswapd. Previously it was considered
> acceptable by VM people for a small number of pages to be written
> back from reclaim with testing generally showing about 0.3% of pages
> reclaimed were written back (higher if memory was low). That writing
> back a small number of pages is ok has been heavily disputed for
> quite some time and Dave Chinner explained it well;
>
> Â Â Â ÂIt doesn't have to be a very high number to be a problem. IO
> Â Â Â Âis orders of magnitude slower than the CPU time it takes to
> Â Â Â Âflush a page, so the cost of making a bad flush decision is
> Â Â Â Âvery high. And single page writeback from the LRU is almost
> Â Â Â Âalways a bad flush decision.
>
> To complicate matters, filesystems respond very differently to requests
> from reclaim according to Christoph Hellwig;
>
> Â Â Â Âxfs tries to write it back if the requester is kswapd
> Â Â Â Âext4 ignores the request if it's a delayed allocation
> Â Â Â Âbtrfs ignores the request
>
> As a result, each filesystem has different performance characteristics
> when under memory pressure and there are many pages being dirties. In
> some cases, the request is ignored entirely so the VM cannot depend
> on the IO being dispatched.
>
> The objective of this series to to reduce writing of filesystem-backed
> pages from reclaim, play nicely with writeback that is already in
> progress and throttle reclaim appropriately when dirty pages are
> encountered. The assumption is that the flushers will always write
> pages faster than if reclaim issues the IO. The new problem is that
> reclaim has very little control over how long before a page in a
> particular zone or container is cleaned which is discussed later. A
> secondary goal is to avoid the problem whereby direct reclaim splices
> two potentially deep call stacks together.
>
> Patch 1 disables writeback of filesystem pages from direct reclaim
> Â Â Â Âentirely. Anonymous pages are still written.
>
> Patches 2-4 add warnings to XFS, ext4 and btrfs if called from
> Â Â Â Âdirect reclaim. With patch 1, this "never happens" and
> Â Â Â Âis intended to catch regressions in this logic in the
> Â Â Â Âfuture.
>
> Patch 5 disables writeback of filesystem pages from kswapd unless
> Â Â Â Âthe priority is raised to the point where kswapd is considered
> Â Â Â Âto be in trouble.
>
> Patch 6 throttles reclaimers if too many dirty pages are being
> Â Â Â Âencountered and the zones or backing devices are congested.
>
> Patch 7 invalidates dirty pages found at the end of the LRU so they
> Â Â Â Âare reclaimed quickly after being written back rather than
> Â Â Â Âwaiting for a reclaimer to find them
>
> Patch 8 disables writeback of filesystem pages from kswapd and
> Â Â Â Âdepends entirely on the flusher threads for cleaning pages.
> Â Â Â ÂThis is potentially a problem if the flusher threads take a
> Â Â Â Âlong time to wake or are not discovering the pages we need
> Â Â Â Âcleaned. By placing the patch last, it's more likely that
> Â Â Â Âbisection can catch if this situation occurs and can be
> Â Â Â Âeasily reverted.
>
> I consider this series to be orthogonal to the writeback work but
> it is worth noting that the writeback work affects the viability of
> patch 8 in particular.
>
> I tested this on ext4 and xfs using fs_mark and a micro benchmark
> that does a streaming write to a large mapping (exercises use-once
> LRU logic) followed by streaming writes to a mix of anonymous and
> file-backed mappings. The command line for fs_mark when botted with
> 512M looked something like
>
> ./fs_mark Â-d Â/tmp/fsmark-2676 Â-D Â100 Â-N Â150 Â-n Â150 Â-L Â25 Â-t Â1 Â-S0 Â-s Â10485760
>
> The number of files was adjusted depending on the amount of available
> memory so that the files created was about 3xRAM. For multiple threads,
> the -d switch is specified multiple times.
>
> 3 kernels are tested.
>
> vanilla 3.0-rc6
> kswapdwb-v2r5 Â Â Â Â Â patches 1-7
> nokswapdwb-v2r5 Â Â Â Â patches 1-8
>
> The test machine is x86-64 with an older generation of AMD processor
> with 4 cores. The underlying storage was 4 disks configured as RAID-0
> as this was the best configuration of storage I had available. Swap
> is on a separate disk. Dirty ratio was tuned to 40% instead of the
> default of 20%.
>
> Testing was run with and without monitors to both verify that the
> patches were operating as expected and that any performance gain was
> real and not due to interference from monitors.
>
> I've posted the raw reports for each filesystem at
>
> http://www.csn.ul.ie/~mel/postings/reclaim-20110721
>
> Unfortunately, the volume of data is excessive but here is a partial
> summary of what was interesting for XFS.

Could you clarify the notation?
1P : 1 Processor?
512M: system memory size?
2X , 4X, 16X: the size of files created during test

>
> 512M1P-xfs      Files/s Âmean     32.99 ( 0.00%)    35.16 ( 6.18%)    35.08 ( 5.94%)
> 512M1P-xfs      Elapsed Time fsmark      122.54        115.54        115.21
> 512M1P-xfs      Elapsed Time mmap-strm    Â105.09        104.44        106.12
> 512M-xfs       Files/s Âmean     30.50 ( 0.00%)    33.30 ( 8.40%)    34.68 (12.06%)
> 512M-xfs       Elapsed Time fsmark      136.14        124.26        120.33
> 512M-xfs       Elapsed Time mmap-strm    Â154.68        145.91        138.83
> 512M-2X-xfs     ÂFiles/s Âmean     28.48 ( 0.00%)    32.90 (13.45%)    32.83 (13.26%)
> 512M-2X-xfs     ÂElapsed Time fsmark      145.64        128.67        128.67
> 512M-2X-xfs     ÂElapsed Time mmap-strm    Â145.92        136.65        137.67
> 512M-4X-xfs     ÂFiles/s Âmean     29.06 ( 0.00%)    32.82 (11.46%)    33.32 (12.81%)
> 512M-4X-xfs     ÂElapsed Time fsmark      153.69        136.74        135.11
> 512M-4X-xfs     ÂElapsed Time mmap-strm    Â159.47        128.64        132.59
> 512M-16X-xfs     Files/s Âmean     48.80 ( 0.00%)    41.80 (-16.77%)    56.61 (13.79%)
> 512M-16X-xfs     Elapsed Time fsmark      161.48        144.61        141.19
> 512M-16X-xfs     Elapsed Time mmap-strm    Â167.04        150.62        147.83
>



--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/