Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Matthew Dobson
Date: Fri Oct 08 2004 - 18:52:38 EST

Next message: Nick Piggin: "Re: [Lse-tech] [RFC PATCH] scheduler: Dynamic sched_domains"
Previous message: Sid Boyce: "Re: [patch] voluntary-preempt-2.6.9-rc3-mm3-T3"
In reply to: Erich Focht: "Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement"
Next in thread: Nick Piggin: "Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 2004-10-07 at 01:51, Paul Jackson wrote:
> > I don't see what non-exclusive cpusets buys us.
>
> One can nest them, overlap them, and duplicate them ;)

<snip example>

> We now have several nested cpusets, each overlapping its ancestors,
> with tasks in each cpuset.
>
> But only the top hpcarena cpuset has the exclusive ownership
> with no form of overlap of everything in its subtree that
> something like a distinct scheduler domain wants.
>
> Hopefully the above is not what you meant by "little more than a
> convenient way to group tasks."

I think this example is easily achievable with the sched_domains
modifications I am proposing. You can still create your 128 CPU
exclusive domain, called big_domain (due to my lack of naming
creativity), and further divide big_domain into smaller, non-exclusive
sched_domains. We do this all the time, albeit statically at boot time,
with the current sched_domains code. When we create a 4-node domain on
IA64, and underneath it we create 4 1-node domains. We've now
partitioned the system into 4 sched_domains, each containing 4 cpus.
Balancing between these 4 node-level sched_domains is allowed, but can
be disallowed by not setting the SD_LOAD_BALANCE flag. Your example
does show that it can be more than just a convenient way to group tasks,
but your example can be done with what I'm proposing.

> > 2) rewrite the scheduler/allocator to deal with these bindings up front,
> > and take them into consideration early in the scheduling/allocating
> > process.
>
> The allocator is less stressed here by varied mems_allowed settings
> than is the scheduler. For in 99+% of the cases, the allocator is
> dealing with a zonelist that has the local (currently executing)
> first on the zonelist, and is dealing with a mems_allowed that allows
> allocation on the local node. So the allocator almost always succeeds
> the first time it goes to see if the candidate page it has in hand
> comes from a node allowed in current->mems_allowed.

Very true. The allocator and scheduler are very different beasts, just
as memory and CPUs are. The allocator does not struggle to cope with
mems_allowed (at least currently) as much as the scheduler struggles to
cope with cpus_allowed.

-Matt

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Nick Piggin: "Re: [Lse-tech] [RFC PATCH] scheduler: Dynamic sched_domains"
Previous message: Sid Boyce: "Re: [patch] voluntary-preempt-2.6.9-rc3-mm3-T3"
In reply to: Erich Focht: "Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement"
Next in thread: Nick Piggin: "Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]