Re: [RFC] AutoNUMA alpha6

From: Andrea Arcangeli
Date: Wed Mar 21 2012 - 08:18:06 EST

Next message: Alan Cox: "Re: linux-next: build failure after merge of the tip tree"
Previous message: Ingo Molnar: "Re: [PATCH 3/3] perf, tool: Add new event group management"
In reply to: Ingo Molnar: "Re: [RFC] AutoNUMA alpha6"
Next in thread: Avi Kivity: "Re: [RFC][PATCH 00/26] sched/numa"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Mar 21, 2012 at 08:53:49AM +0100, Ingo Molnar wrote:
> My impression is that while threading is on the rise due to its
> ease of use, many threaded HPC workloads still fall into the
> second category.

This is why after Peter's initial complains that a threaded
application had to be handled perfectly by AutoNUMA even if it had
more threads than CPU in a node, I had to take a break, and rewrite
part of AutoNUMA to handle this scenario automatically, by introducing
the numa hinting page faults. Before Peter complains I only had the
pagetable scanner. So I appreciate his criticism for having convinced
me that AutoNUMA had to have this working immediately.

Perhaps somebody remembers what I told at KVMForum on stage about
this, back then I was planning to automatically handle only processes
that fit in a node. So the talk with Peter has been fundamental to add
one more gear to the design or I wouldn't be able to compete with his
syscalls.

> In fact they are often explicitly *turned* into the second
> category at the application level by duplicating shared global
> data explicitly and turning it into per thread local data.

per-thread local data is the best case of AutoNUMA. AutoNUMA already
detects and reacts to false sharing putting all false-sharing threads
in the same node statistically over time. It also cancels pending
migration pages queued, and requires two more consecutive hits from
threads in the same node before re-allowing migration. There's quite a
bit of work I did to make false sharing handled properly. But the
absolute best case is per-thread local storage (both numa01
-DTHREAD_ALLOC and numa02, numa02 spans over the whole system with the
same process, numa01 has two processes, where each fit in a node, with
local thread storage).

> And to default-enable any of this on stock kernels we'd need to
> even more testing and widespread, feel-good speedups in almost
> every key Linux workload... I don't see that happening though,
> so the best we can get are probably some easy and flexible knobs
> for HPC.

This is a very good point. We can merge AutoNUMA in a disabled way. It
won't ever do anything unless explicitly enabled, and even more
important if you disable it (echo 0 >enabled) it will deactivate
completely and everything will settle down like if has never run, it
will leave zero signs in the VM and scheduler.

There are three gears, if the pagetable scanner never runs (first
gear), all other gears never activates and it is a complete bypass (noop).

There are environments like virt that are quite memory static and
predictable, so if demonstrated it would work for them, it would be
real easy for virt admin to echo 1 >/sys/kernel/mm/autonuma/enabled .
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Alan Cox: "Re: linux-next: build failure after merge of the tip tree"
Previous message: Ingo Molnar: "Re: [PATCH 3/3] perf, tool: Add new event group management"
In reply to: Ingo Molnar: "Re: [RFC] AutoNUMA alpha6"
Next in thread: Avi Kivity: "Re: [RFC][PATCH 00/26] sched/numa"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]