[RFC] AutoNUMA alpha6

From: Andrea Arcangeli
Date: Fri Mar 16 2012 - 14:26:37 EST


On Fri, Mar 16, 2012 at 03:40:28PM +0100, Peter Zijlstra wrote:
> And a few numbers...

Could you try my two trivial benchmarks I sent on lkml too? That
should take less time than the effort you did to add those performance
numbers to perf. I use those benchmarks as a regression test for my
code. They exercise a more complex scenario than "sleep 2" so
supposedly the results will be more interesting.

You find both programs in this link:

http://lists.openwall.net/linux-kernel/2012/01/27/9

These are my results.

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120126.pdf

I happened to have released the autonuma source yesterday on my git
tree:

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog;h=refs/heads/autonuma

git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=patch;h=30ed50adf6cfe85f7feb12c4279359ec52f5f2cd;hp=c03cf0621ed5941f7a9c1e0a343d4df30dbfb7a1

It's a big monlithic patch, but I'll split it.

THP native migration isn't complete yet so it degrades more than it
should when comparing THP autonuma vs hard bind THP. But that can be
done later and it'll benefit move_pages or any other userland hard
binding too, not just the autonuma kernel side. I guess you need this
feature too.

The scanning rate must be tuned, it's possibly too fast as default
because all my benchmarks tends to be short lived. There's already
lots of tuning available in /sys/kernel/mm/autonuma .

There's lots of other tuning to facilitate testing the different
algorithms. By default the numa balancing decisions will keep the
process stuck in its own node, unless there's an idle cpu, but there's
a model to let it escape the node for load balancing/fairness reasons
(to be closer to the stock scheduler) by setting load_balance_strict
to zero (default is 1).

There's also a knuma_scand/working_set tweak to scan the working set
and not all memory (so only care about what's hot, if the app has a
ton of memory that isn't using in some node, that won't be accounted
anymore in the memory migration and CPU migration decisions).

There's no syscall or hint userland can give.

The migration doesn't happen during page fault. There's proper
knuma_migrated daemon per node. The daemon has per-node array of
page-lists. Then knuma_migrated0 is waken with some hysteresis, and
will pick pages that wants to go from node1 to node0, from node 2 to
node0 etc.. and it'll pick them in round robin fascion across all
nodes. That stops when the node0 is out of memory and the cache would
be shrunk or in most cases when there aren't more pages to
migrate. One of the missing features is to start balancing cache
around but I'll add that later and I've already reserved one slot in
the pgdat for that. All other numa_migratedN also runs, so we're
guaranteed to make progress when process A going from node0 to node1,
and process B going from node1 to node0.

All memory that isn't shared is migrated, that includes mapped
pagecache.

The basic logic is scheduler following the memory and memory following
CPU, until things converge.

I'm skeptical in general that any NUMA hinting syscall will be used by
anything except qemu and that's what motivated my design. Hopefully in
the future CPU vendors will provide us a better way to track memory
locality than what I'm doing right now in software. The cost is almost
unmeasurable (even if you disable the pmd mode). I'm afraid with virt
the cost could be higher because of the vmexists but virt is long
lived and a slower scanning rate for the memory layout info should be
ok.

Here also huge amount of improvements are possible. Hopefully it's not
too intrusive either.

I also wrote a threaded userland tool that can render visually at
>20frames per sec the status of the memory and shows the memory
migration (the ones I found were on python and with >8G of ram they
just can't deliver). I was going to try to make it per-process instead
of global before releasing it, that may give another speedup (or
slowdown I don't know for sure). It'll help explain what the code does
and see it in action. But for us echo 1 >/sys/kernel/mm/autonuma/debug
may be enough. Still the visual thing is cool and if done generically
it would be interesting. Ideally once it goes per process it should
show which CPU the process is running on too, not just where the
process memory is.

arch/x86/include/asm/paravirt.h | 2 -
arch/x86/include/asm/pgtable.h | 51 ++-
arch/x86/include/asm/pgtable_types.h | 22 +-
arch/x86/kernel/cpu/amd.c | 4 +-
arch/x86/kernel/cpu/common.c | 4 +-
arch/x86/kernel/setup_percpu.c | 1 +
arch/x86/mm/gup.c | 2 +-
arch/x86/mm/numa.c | 9 +-
fs/exec.c | 3 +
include/asm-generic/pgtable.h | 13 +
include/linux/autonuma.h | 41 +
include/linux/autonuma_flags.h | 62 ++
include/linux/autonuma_sched.h | 61 ++
include/linux/autonuma_types.h | 54 ++
include/linux/huge_mm.h | 7 +-
include/linux/kthread.h | 1 +
include/linux/mm_types.h | 29 +
include/linux/mmzone.h | 6 +
include/linux/sched.h | 4 +
kernel/exit.c | 1 +
kernel/fork.c | 36 +-
kernel/kthread.c | 23 +
kernel/sched/Makefile | 3 +-
kernel/sched/core.c | 13 +-
kernel/sched/fair.c | 55 ++-
kernel/sched/numa.c | 322 ++++++++
kernel/sched/sched.h | 12 +
mm/Kconfig | 13 +
mm/Makefile | 1 +
mm/autonuma.c | 1465 ++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 32 +-
mm/memcontrol.c | 2 +-
mm/memory.c | 36 +-
mm/mempolicy.c | 15 +-
mm/mmu_context.c | 2 +
mm/page_alloc.c | 19 +
36 files changed, 2376 insertions(+), 50 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/