[RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE

From: Andrea Righi
Date: Sat Feb 11 2012 - 19:23:30 EST


There were some reported problems in the past about trashing page cache when a
backup software (i.e., rsync) touches a huge amount of pages (see for example
[1]).

This problem was mitigated by the Minchan's patch [2] and a proper use of
fadvise() in the backup software. For example this patch set [3] has been
proposed for inclusion in rsync.

However, there are still other trashing problems: when the backup software
reads all the source files, some of them may be part of the actual working set
of the system. If a POSIX_FADV_DONTNEED is performed _all_ pages are evicted
from pagecache, both the working set and the use-once pages touched only by the
backup.

A previous proposal [4] was to use the POSIX_FADV_NOREUSE hint, currently
implemented as a no-op, to provide a page invalidation policy less agressive
than POSIX_FADV_DONTNEED (by moving active pages to the tail of the inactive
list, and dropping the pages from the inactive list).

However, as correctly pointed out by KOSAKI, this behavior is very different
respect to the POSIX definition:

POSIX_FADV_NOREUSE
Specifies that the application expects to access the specified data once and then
not reuse it thereafter.

The new proposal is to implement POSIX_FADV_NOREUSE as a way to perform a real
drop-behind policy where applications can mark certain intervals of a file as
FADV_NOREUSE before accessing the data.

In this way the marked pages will never blow away the current working set. This
requirement can be satisfied by preventing lru activation of the FADV_NOREUSE
pages. Moreover, all the file cache pages in a FADV_NOREUSE range will be
immediately dropped after a read if the page was not present in the cache
before. The purpose is to preserve as much as possible the previous state of
the file cache memory before reading data in ranges marked via FADV_NOREUSE.

The list of FADV_NOREUSE ranges are maintained into an interval tree [5] inside
the address_space structure (for this a generic interval tree implementation is
also provided by PATCH 1/2). The intervals are dropped when pages are also
dropped from the page cache or when a different caching hint is specified on
the same intervals.

Example:

How this can solve the "backup is trashing my page cache" issue?

The "backup" software can be changed as following:

- before:
- fd_src = open /path/very_large_file
- fd_dst = open /backup/very_large_file
- read from fd_src
- write to fd_dst
- close fd_src and fd_dst

- after:
- fd_src = open /path/large_file
- fd_dst = open /backup/very_large_file
- posix_fadvise(fd_src, start, end, POSIX_FADV_NOREUSE)
- read from fd_src (even multiple times)
- write to fd_dst
- posix_fadvise(fd_src, start, end, POSIX_FADV_NORMAL)
- posix_fadvise(fd_dst, start, end, POSIX_FADV_DONTNEED)
- close fd_src and fd_dst

Simple test case:

- read a chunk of 256MB from the middle of a large file (this represents the
working set)

- simulate a backup-like software reading the whole file using
POSIX_FADV_NORMAL, or POSIX_FADV_DONTNEED or POSIX_FADV_NOREUSE

- re-read the same 256MB chunk from the middle of the file and measure
the time and the bio submitted (perf stat -e block:block_bio_queue)

The test is done starting both with the working set in the inactive lru list
(the 256MB chunk read once at the beginning of the test) and the working set in
the active lru list (the 256MB chunk read twice at the beginning of the test).

Results:

inact_working_set = the working set is all in the inactive lru list
act_working_set = the working set is all in the active lru list

- block:block_bio_queue (events)
inact_working_set act_working_set
POSIX_FADV_NORMAL 2,052 0
POSIX_FADV_DONTNEED 0(*) 2048
POSIX_FADV_NOREUSE 0 0

- elapsed time (sec)
inact_working_set act_working_set
POSIX_FADV_NORMAL 1.013 0.070
POSIX_FADV_DONTNEED 0.070(*) 1.006
POSIX_FADV_NOREUSE 0.070 0.070

(*) With POSIX_FADV_DONTNEED I would expect to see the working set pages
dropped from the page cache when it starts in the inactive lru list.

Instead this happens only when we start with the working set all in the active
list. IIUC this happens because when we re-read the pages the second time
they're moved to the per-cpu vector activate_page_pvecs; if the page vector is
not yet drained when we issue the POSIX_FADV_DONTNEED the advice is ignored.

Anyway, for this particular test case the best solution to preserve the state
of the page cache is obviouly the usage of POSIX_FADV_NOREUSE.

Regression test:

for i in `seq 1 10`; do
fio --name=fio --directory=. --rw=read --bs=4K --size=1G --numjobs=4 | grep READ:
done

- before:
READ: io=4096.0MB, aggrb=244737KB/s, minb=62652KB/s, maxb=63474KB/s, mint=16916msec, maxt=17138msec
READ: io=4096.0MB, aggrb=242263KB/s, minb=62019KB/s, maxb=64380KB/s, mint=16678msec, maxt=17313msec
READ: io=4096.0MB, aggrb=241913KB/s, minb=61929KB/s, maxb=62766KB/s, mint=17107msec, maxt=17338msec
READ: io=4096.0MB, aggrb=244337KB/s, minb=62550KB/s, maxb=63776KB/s, mint=16836msec, maxt=17166msec
READ: io=4096.0MB, aggrb=242768KB/s, minb=62148KB/s, maxb=62517KB/s, mint=17175msec, maxt=17277msec
READ: io=4096.0MB, aggrb=242796KB/s, minb=62155KB/s, maxb=63191KB/s, mint=16992msec, maxt=17275msec
READ: io=4096.0MB, aggrb=244352KB/s, minb=62554KB/s, maxb=63392KB/s, mint=16938msec, maxt=17165msec
READ: io=4096.0MB, aggrb=242011KB/s, minb=61954KB/s, maxb=62368KB/s, mint=17216msec, maxt=17331msec
READ: io=4096.0MB, aggrb=241676KB/s, minb=61869KB/s, maxb=63738KB/s, mint=16846msec, maxt=17355msec
READ: io=4096.0MB, aggrb=242319KB/s, minb=62033KB/s, maxb=63362KB/s, mint=16946msec, maxt=17309msec

avg aggrb = 242917KB/s, avg mint = 16965msec, avg maxt = 17267msec

- after:
READ: io=4096.0MB, aggrb=243968KB/s, minb=62455KB/s, maxb=63306KB/s, mint=16961msec, maxt=17192msec
READ: io=4096.0MB, aggrb=242979KB/s, minb=62202KB/s, maxb=63127KB/s, mint=17009msec, maxt=17262msec
READ: io=4096.0MB, aggrb=242473KB/s, minb=62073KB/s, maxb=62285KB/s, mint=17239msec, maxt=17298msec
READ: io=4096.0MB, aggrb=244494KB/s, minb=62590KB/s, maxb=63272KB/s, mint=16970msec, maxt=17155msec
READ: io=4096.0MB, aggrb=244352KB/s, minb=62554KB/s, maxb=63269KB/s, mint=16971msec, maxt=17165msec
READ: io=4096.0MB, aggrb=241969KB/s, minb=61944KB/s, maxb=63444KB/s, mint=16924msec, maxt=17334msec
READ: io=4096.0MB, aggrb=243303KB/s, minb=62285KB/s, maxb=62543KB/s, mint=17168msec, maxt=17239msec
READ: io=4096.0MB, aggrb=243232KB/s, minb=62267KB/s, maxb=63109KB/s, mint=17014msec, maxt=17244msec
READ: io=4096.0MB, aggrb=241969KB/s, minb=61944KB/s, maxb=62652KB/s, mint=17138msec, maxt=17334msec
READ: io=4096.0MB, aggrb=241649KB/s, minb=61862KB/s, maxb=62616KB/s, mint=17148msec, maxt=17357msec

avg aggrb = 243038KB/s, avg mint = 17054msec, avg maxt = 17258msec

No obvious performance regression was found according to this simple test.

Credits:
- Some of the routines to implement the generic interval tree has been taken
from the x86 PAT code, that uses interval trees to keep track of PAT ranges
(in the future it would be interesting to convert also the x86 PAT code to
use the generic interval tree implementation).

- The idea to store the FADV_NOREUSE intervals into the address_space
structure as been inspired by the John's POSIX_FADV_VOLATILE patch [6].

References:
[1] http://marc.info/?l=rsync&m=128885034930933&w=2
[2] https://lkml.org/lkml/2011/2/20/57
[3] http://lists.samba.org/archive/rsync/2010-November/025827.html
[4] http://thread.gmane.org/gmane.linux.kernel.mm/65493
[5] http://en.wikipedia.org/wiki/Interval_tree
[6] http://thread.gmane.org/gmane.linux.kernel/1218654

ChangeLog v4 -> v5:
- completely new redesign: implement the expected drop-behind policy
maintaining the list of FADV_NOREUSE ranges inside the file

[PATCH v5 1/3] kinterval: routines to manipulate generic intervals
[PATCH v5 2/3] mm: filemap: introduce mark_page_usedonce
[PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

fs/inode.c | 3 +
include/linux/fs.h | 12 ++
include/linux/kinterval.h | 126 ++++++++++++
include/linux/swap.h | 1 +
lib/Makefile | 2 +-
lib/kinterval.c | 483 +++++++++++++++++++++++++++++++++++++++++++++
mm/fadvise.c | 18 ++-
mm/filemap.c | 95 +++++++++-
mm/swap.c | 24 +++
9 files changed, 760 insertions(+), 4 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/