Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul Jackson
Date: Thu Oct 07 2004 - 04:00:03 EST


> I don't see what non-exclusive cpusets buys us.

One can nest them, overlap them, and duplicate them ;)

For example, we could do the following:

* Carve off CPUs 128-255 of a 256 CPU system in which
to run various HPC jobs, requiring numbers of CPUs.
This is named /dev/cpuset/hpcarena, and it is the really
really exclusive and isolated sort of cpuset which can and
does have its own scheduler domain, for a scheduler configuration
that is tuned for running a mix of HPC jobs. In this hpcarena
also runs the per-cpu kernel threads that are pinned on CPUs
128-255 (for _all_ tasks running on an exclusive cpuset
must be in that cpuset or below).

* The testing group gets half of this cpuset each weekend, in
order to run a battery of tests: /dev/cpuset/hpcarena/testing.
In this testing cpuset runs the following batch manager.

* They run a home brew batch manager, which takes an input
stream of test cases, carves off a small cpuset of the
requested size, and runs that test case in that cpuset.
This results in cpusets with names like:
/dev/cpuset/hpcarena/testing/test123. Our test123 is
running in this cpuset.

* Test123 here happens to be a test of the integrity of cpusets,
so sets up a couple of cpusets to run two independent jobs,
each a 2 CPU MPI job. This results in the cpusets:
/dev/cpuset/hpcarena/testing/test123/a and
/dev/cpuset/hpcarena/testing/test123/b. Our little
MPI jobs 'a' and 'b' are running in these two cpusets.

We now have several nested cpusets, each overlapping its ancestors,
with tasks in each cpuset.

But only the top hpcarena cpuset has the exclusive ownership
with no form of overlap of everything in its subtree that
something like a distinct scheduler domain wants.

Hopefully the above is not what you meant by "little more than a
convenient way to group tasks."


> 2) rewrite the scheduler/allocator to deal with these bindings up front,
> and take them into consideration early in the scheduling/allocating
> process.

The allocator is less stressed here by varied mems_allowed settings
than is the scheduler. For in 99+% of the cases, the allocator is
dealing with a zonelist that has the local (currently executing)
first on the zonelist, and is dealing with a mems_allowed that allows
allocation on the local node. So the allocator almost always succeeds
the first time it goes to see if the candidate page it has in hand
comes from a node allowed in current->mems_allowed.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.650.933.1373
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/