Re: large files unnecessary trashing filesystem cache?

From: Andrew Morton
Date: Tue Oct 18 2005 - 23:47:19 EST


Andrew James Wade <andrew.j.wade@xxxxxxxxx> wrote:
>
> Sometimes you want a single file to take up most of the memory; databases
> spring to mind. Perhaps files/processes that take up a large proportion of
> memory should be penalized by preferentially reclaiming their pages, but
> limit the aggressiveness so that they can still take up most of the memory
> if sufficiently persistent (and the rest of the system isn't thrashing).

Yes. Basically any smart heuristic we apply here will have failure modes.
For example, the person whose application does repeated linear reads of the
first 100MB of a 4G file will get very upset.

So any such change really has to be opt-in. Yes, it can be done quite
simply via repeated application of posix_fadvise(). But practically
speaking, it's very hard to get upstream GPL'ed applications changed, let
alone proprietary ones.

An obvious approach would be an LD_PRELOAD thingy which modifies read() and
write(), perhaps controlled via an environment variable. AFAIK nobody has
even attempted this.

For a kernel-based solution you could take a look at my old 2.4-based
O_STREAMING patch. It works OK, but it still needs modification of each
application (or an LD_PRELOAD hook into open()).

A decent kernel implementation would be to add a max_resident_pages to
struct file_struct and to use that to perform drop-behind within read() and
write(). That's a bit of arithmetic and a call to
invalidate_mapping_pages(). The userspace interface to that could be a
linux-specific extension to posix_fadvise() or to fcntl().

But that still requires that all the applications be modified.

So I'd also suggest a new resource limit which, if set, is copied into the
applications's file_structs on open(). So you then write a little wrapper
app which does setrlimit()+exec():

limit-cache-usage -s 1000 my-fave-backup-program <args>

Which will cause every file which my-fave-backup-program reads or writes to
be limited to a maximum pagecache residency of 1000 kbytes.

This facility could trivially be used for a mini-DoS: shooting down other
people's pagecache so their apps run slowly. But you can use fadvise() for
that already.


Or raise a patch against glibc's read() and write() which uses some
environment string to control fadvise-based invalidation. That's pretty
simple.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/