Re: [PATCH 5/8] sched, numa, mm: Add adaptive NUMA affinity support

From: Rik van Riel
Date: Fri Nov 16 2012 - 13:23:14 EST

Next message: Luciano Coelho: "Re: [PATCH 4/7] wlcore: Fix the usage of wait_for_completion_timeout"
Previous message: Luciano Coelho: "Re: [PATCH 020/104] drivers/net/wireless/ti/wl1251: remove dependson CONFIG_EXPERIMENTAL"
In reply to: Ingo Molnar: "Re: [PATCH 5/8] sched, numa, mm: Add adaptive NUMA affinity support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 11/16/2012 01:14 PM, Ingo Molnar wrote:

* Rik van Riel <riel@xxxxxxxxxx> wrote:

On 11/12/2012 11:04 AM, Peter Zijlstra wrote:

We change the load-balancer to prefer moving tasks in order of:

1) !numa tasks and numa tasks in the direction of more faults
2) allow !ideal tasks getting worse in the direction of faults
3) allow private tasks to get worse
4) allow shared tasks to get worse

This order ensures we prefer increasing memory locality but when
we do have to make hard decisions we prefer spreading private
over shared, because spreading shared tasks significantly
increases the interconnect bandwidth since not all memory can
follow.

Combined with the fact that we only turn a certain amount of
memory into NUMA ptes each second, could this result in a
program being classified as a private task one second, and a
shared task a few seconds later?

It's a statistical method, like most of scheduling.

It's as prone to oscillation as tasks are already prone to being
moved spuriously by the load balancer today, due to the per CPU
load average being statistical and them being slightly above or
below a critical load average value.

Higher freq oscillation should not happen normally though, we
dampen these metrics and have per CPU hysteresis.

( We can also add explicit hysteresis if anyone demonstrates
real oscillation with a real workload - wanted to keep it
simple first and change it only as-needed. )

This heuristic is by no means simple, and there still is no
explanation for the serious performance degradations that
were seen on a 4 node system running specjbb in 4 node-sized
JVMs.

I asked a number of questions on this patch yesterday, and
am hoping to get explanations at some point :)

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Luciano Coelho: "Re: [PATCH 4/7] wlcore: Fix the usage of wait_for_completion_timeout"
Previous message: Luciano Coelho: "Re: [PATCH 020/104] drivers/net/wireless/ti/wl1251: remove dependson CONFIG_EXPERIMENTAL"
In reply to: Ingo Molnar: "Re: [PATCH 5/8] sched, numa, mm: Add adaptive NUMA affinity support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]