Re: [PATCH 0/12] cleanup __build_sched_domains()

From: Andreas Herrmann
Date: Tue Aug 18 2009 - 09:17:08 EST


On Tue, Aug 18, 2009 at 01:16:44PM +0200, Ingo Molnar wrote:
>
> * Andreas Herrmann <andreas.herrmann3@xxxxxxx> wrote:
>
> > Hi,
> >
> > Following patches try to make __build_sched_domains() less ugly
> > and more readable. They shouldn't be harmful. Thus I think they
> > can be applied for .32.
> >
> > Patches are against tip/master as of today.
> >
> > FYI, I need those patches as a base for introducing a new domain
> > level for multi-node CPUs for which I intend to sent patches as
> > RFC asap.
>
> Very nice cleanups!
>
> Magny-Cours indeed will need one more sched-domains level,
> something like:
>
> [smt thread]
> core
> internal numa node
> cpu socket
> external numa node

My current approach is to have the numa node domain either below CPU
(in case of multi-cpu node where SRAT describes each internal node as
a NUMA node) or as is, as the top-level domain (e.g. in case of node
interleaving or missing/broken ACPI SRAT detection).

Sched domain levels (note SMT==SIBLING, NODE==NUMA) are:

(1) groups in NUMA domain are subsets of groups in CPU domain
(2) groups in NUMA domain are supersets groups in CPU domain

(1) | (2)
------------|-------------------
SMT | SMT
MC | MC
MN (new) | MN
NUMA | CPU
CPU | NUMA

I'll also introduce a new parameter sched_mn_power_savings which will
cause that tasks are scheduled on one socket until its capacity is
reached. If capacity is reached other sockets can also be occupied.

> ... which is certainly interesting, especially since the hierarchy
> possibly 'crosses', i.e. we might have the two internal numa nodes
> share a L2 or L3 cache, right?

> I'd also not be surprised if the load-balancer needed some care to
> properly handle such a setup.

It needs some care and gave me some headache to get it working in all
cases (i.e. NUMA, no-NUMA, NUMA-but-no-SRAT). My current code (that
still needs to be split in proper patches for submission) works fine
in all but one case. And I am still debugging it.

The case that is not working is a normal (non-multi-node) NUMA system
on which switching to power policy does not take effect for already
running tasks. Just the new created ones are scheduled according to
the power policy.

> It's all welcome work in any case, and for .32.


Thanks,

Andreas

--
Operating | Advanced Micro Devices GmbH
System | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
Center | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
(OSRC) | Registergericht München, HRB Nr. 43632


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/