Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page agingpolicy configurable

From: Mel Gorman
Date: Tue Dec 17 2013 - 16:22:27 EST


On Tue, Dec 17, 2013 at 12:43:02PM -0500, Johannes Weiner wrote:
> > > > When looking at this closer I found that sysv is a weird exception. It's
> > > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > > most applications that care. That and MAP_SHARED anonymous pages should
> > > > not be treated like files but we still want tmpfs to be treated as
> > > > files. Details will be in the changelog of the next series.
> > >
> > > In what sense is it seen as file-backed?
> >
> > sysv and anonymous pages are backed by an internal shmem mount point. In
> > lots of respects, it's looks like a file and quacks like a file but I expect
> > developers think of it being anonmous and chunks of the VM treats it like
> > it's anonymous. tmpfs uses the same paths and they get treated similar to
> > the VM as anon but users may think that tmpfs should be subject to the
> > fair allocation zone policy "because they're files." It's a sufficently
> > weird case that any action we take there should be deliberate. It'll be
> > a bit clearer when I post the patch that special cases this.
>
> The line I see here is mostly derived from performance expectations.
>
> People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
> their reclaim at great costs, so they size this part of their workload
> according to memory size and locality. Filesystem cache (on-disk) on
> the other hand is expected to be slow on the first fault and after it
> has been displaced by other data, but the kernel is mostly expected to
> maximize the caching effects in a predictable manner.
>

Part of their performance expectations is that memory referenced from the
local node will be allocated locally. Consider NUMA-aware applications that
partition their data usage appropriately and share that data between threads
using processes and shared memory (some MPI implementations). They have
an expectation that the memory will be local and a further expectation
that it will not be reclaimed because they sized it appropriately.
Automatically interleaving such memory by default will be surprising to
NUMA aware applications even if NUMA-oblivious applications benefit.

Similarly, the pagecache sysctl is documented to affect files, at least
that's how I wrote it. It's inconsistent to explain that as "the sysctl
control files, except for tmpfs ones because ...... whatever".

> The round-robin policy makes the displacement predictable (think of
> the aging artifacts here where random pages do not get displaced
> reliably because they ended up on remote nodes) and it avoids IO by
> maximizing memory utilization.
>
> I.e. it improves behavior associated with a cache, but I don't expect
> shmem/tmpfs to be typically used as a disk cache. I could be wrong
> about that, but I figure if you need named shared memory that is
> bigger than your memory capacity (the point where your tmpfs would
> actually turn into a disk cache), you'd be better of using a more
> efficient on-disk filesystem.

I am concerned with semantics like "all files except tmpfs files" or
alternatively regressing performance of NUMA-aware applications and their
use of MAP_SHARED and sysv.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/