Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

From: Andi Kleen
Date: Tue Feb 15 2005 - 16:49:50 EST


> Making memory migration a subset of page migration is not a general
> solution. It only works for programs that are using memory policy
> to control placement. As I've tried to point out multiple times
> before, most programs that I am aware of use placement based on
> first-touch. When we migrate such programs, we have to respect
> the placement decisions that the program has implicitly made in
> this way.

Sorry, but the only real difference between your API and mbind is that
yours has a pid argument.

I think we are talking by each other, here's a more structured
overview of my thinking on the issue.

Main cases:

- Program is NUMA API aware. Fine. It takes care of its own.
- Program is not aware, but is started with a process policy from
numactl/cpusets/batch scheduler. Already covered too in NUMA API.
- Program is not aware and hasn't been started with a policy
(or has and you change your mind) but you want to change it later.
That's the new case we discuss here.

Now how to change policy of objects in an already running process.

First there are already some special cases already handled or
with existing patches:
- tmpfs/hugetlbfs/sysv shm: numactl can handle this by just mapping
the object into a different process and changing the policy there.
- shared libraries and mmaped files in general: this is a generialization of
the previous point. SteveL's patch is the beginning of handling this, although
it needs some more work (xattrs) to make the policy persistent over
memory pressure.

Only case not covered left is anonymous memory.

You said it would need user space control, but the main reason for
wanting that seems to be to handle the non anonymous cases which
are already covered above.

My thinking is the simplest way to handle that is to have a call that just o
migrates everything. The main reasons for that is that I don't think external
processes should mess with virtual addresses of another process.
It just feels unclean and has many drawbacks (parsing /proc/*/maps
needs complicated user code, racy, locking difficult).

In kernel space handling full VMs is much easier and safer due to better
locking facilities.

In user space only the process itself really can handle its own virtual
addresses well, and if it does that it can use NUMA API directly anyways.

You argued that it may be costly to walk everything, but I don't see this
as a big problem - first walking mempolicies is not very costly and then
fork() and exit() do exactly this already.

The main missing piece for this would be a way to make policies for
files persistent. One way would be to use xattrs like selinux, but
that may be costly (not sure we want to read xattrs all the time
when reading a file).

A hackish way to do this that already
works would be to do a mlock on one page of the file to keep
the inode pinned. E.g. the batch manager could do this. That's
not very clean, but would probably work.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/