Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILEflags

From: Rik van Riel
Date: Tue Nov 22 2011 - 04:38:12 EST


On 11/21/2011 10:33 PM, John Stultz wrote:
This patch provides new fadvise flags that can be used to mark
file pages as volatile, which will allow it to be discarded if the
kernel wants to reclaim memory.

This is useful for userspace to allocate things like caches, and lets
the kernel destructively (but safely) reclaim them when there's memory
pressure.

Right now, we can simply throw away pages if they are clean (backed
by a current on-disk copy). That only happens for anonymous/tmpfs/shmfs
pages when they're swapped out. This patch lets userspace select
dirty pages which can be simply thrown away instead of writing them
to disk first. See the mm/shmem.c for this bit of code. It's
different from FADV_DONTNEED since the pages are not immediately
discarded; they are only discarded under pressure.

I've got a few questions:

1) How do you tell userspace some of its data got
discarded?

2) How do you prevent the situation where every
volatile object gets a few pages discarded, making
them all unusable?
(better to throw away an entire object at once)

3) Isn't it too slow for something like Firefox to
create a new tmpfs object for every single throw-away
cache object?

Johannes, Jon and I have looked at an alternative way to
allow the kernel and userspace to cooperate in throwing
out cached data. This alternative way does not touch
the alloc/free fast path at all, but does require some
cooperation at "shrink cache" time.

The idea is quite simple:

1) Every program that we are interested in already has
some kind of main loop where it polls on file descriptors.
It is easy for such programs to add an additional file,
which would be a device or sysfs file that wakes up the
program from its poll/select loop when memory is getting
full to the point that userspace needs to shrink its
caches.

The kernel can be smart here and wake up just one process
at a time, targeting specific NUMA nodes or cgroups. Such
kernel smarts do not require additional userspace changes.

2) When userspace gets such a "please shrink your caches"
event, it can do various things. A program like firefox
could throw away several cached objects, eg. uncompressed
images or entire pre-rendered tabs, while a JVM can shrink
its heap size and a database could shrink its internal
cache.

3) After doing that, they could all call the same glibc
function that walks across program-internal free memory
and calls MADV_FREE on all free regions that span
multiple pages, which gives the pages back to the kernel,
without needing to move VMA boundaries. This is relatively
light weight and allows for the nuking of pages right in
the middle of a heap VMA.

4) In some GUI libraries, like gtk/glib, we could open the
memory pressure device node (or sysfs file) by default,
hooking it up to the glibc function from (3) by default,
which would give all gtk/glib programs the ability to
give free()d memory back to the kernel on request, without
needing to even modify the program.

Program modification would only be needed in order to
free cached objects, etc. The modification of programs
running under those libraries would consist of overriding
the "shrink caches" hook with their own function, which
first does program-specific stuff and then calls the
default hook to take care of the glibc side.

We considered the same approach you are proposing as well, but
we did not come up with satisfactory answers to the questions I
asked above, which is why we came up with this scheme.

Unfortunately we have not gotten around to implementing it yet,
but I'd be happy to work on it with you guys if you are
interested.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/