Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul Jackson
Date: Tue Oct 05 2004 - 19:36:50 EST


Martin wrote:
> Nope - personally I see us more headed for the exclusive cpusets, and
> handle the non-exclusive stuff via a more CKRM-style mechanism.

As Simon said, no go.

Martin:

1) Are you going to prevent sched_setaffinity calls as well?
What about the per-cpu kernel threads?

See my reply to Hubertus on this thread:

Date: Tue, 5 Oct 2004 07:20:48 -0700
Message-Id: <20041005072048.16632106.pj@xxxxxxx>

2) Do you have agreement from the LSF and PBS folks that they
can port to systems that support "shares" (the old Cray
term roughly equivalent to CKRM), but lacking placement
for jobs using shared resources? I doubt it.

3) Do you understand that OpenMP and MPI applications can really
need placement, in order to get separate threads on separate
CPUs, to allow concurrent execution, even when they aren't using
(or worth providing) 100% of each CPU they are on.

4) Continuing on item (1), I think that CKRM is going to have to
deal with varying, detailed placement constraints, such as is
presently implemented using a variety of settings of cpus_allowed
and mems_allowed. So too will schedulers and allocators. We can
setup a few, high level domains, that correspond to entire cpuset
subtrees, that have closer to the exclusive properties that
you want (stronger than the current cpuset exclusive flag ensures).
But within any of those domains, we need a mix of exclusive
and non-exclusive placement.

The CKRM controlled shares style of mechanism is appropriate when
one CPU cycle is as good as another, and one just needs to manage
what share of the total capacity a given class of users receive.

There are other applications, such as OpenMP and MPI applications with
closely coupled parallel threads, that require placement, including in
setups where that application doesn't get a totally isolated exclusive
'soft' partition of its own. If an OpenMP or MPI job doesn't have each
separate thread placed on a distinct CPU, it runs like crud. This is
so whether the job has its own dedicated cpuset, or it is sharing CPUs.

And there are important system management products, such as OpenPBS and
LSF, which rely on placement of jobs in named sets of CPUs and Memory
Nodes, both for jobs that are closely coupled parallel and jobs that are
not, both for jobs that have exclusive use of the CPUs and Memory Nodes
assigned to them and not.

CKRM cannot make these other usage patterns and requirements go away,
and even if it could force cpusets to only come in the totally isolated
flavor, CKRM would still have to deal with the placement that occurs
on a thread-by-thread basis that is essential to the performance of
tightly coupled thread applications and essential to the basic function
of certain per-cpu kernel threads.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.650.933.1373
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/