Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul Jackson
Date: Sun Oct 03 2004 - 19:49:28 EST


Martin wrote:
>
> Mmmm. The fundamental problem I think we ran across (just whilst pondering,
> not in code) was that some things (eg ... init) are bound to ALL cpus (or
> no cpus, depending how you word it); i.e. they're created before the cpusets
> are, and are a member of the grand-top-level-uber-master-thingummy.
>
> How do you service such processes? That's what I meant by the exclusive
> domains aren't really exclusive.

I move 'em. I have user code that identifies the kernel threads whose
cpus_allowed is a superset of cpus_online_map, and I put them in a nice
little padded cell with init and the classic Unix daemons, called the
'bootcpuset'.

The tasks whose cpus_allowed is a strict _subset_ of cpus_online_map
need to be where they are. These are things like the migration helper
threads, one for each cpu. They get a license to violate cpuset
boundaries.

I will probably end up submitting a patch at some point, that changes
two lines, one in ____call_usermodehelper() and one in kthread(), from
setting the cpus_allowed on certain kernel threads to CPU_MASK_ALL, so
that instead these lines set that cpus_allowed to a new mask, a kernel
global variable that can be read and written via the cpuset api. But
other than that, I don't need anymore kernel hooks than I already have,
and even now, I can get everything that's causing me any grief pinned
into the bootcpuset.


> But that's the problem ... I think there are *always* cpusets that overlap.
> Which is sad (fixable?) because it breaks lots of intelligent things we
> could do.

So with my bootcpuset, the problem is reduced, to a few tasks per CPU,
such as the migration threads, which must remain pinned on their one CPU
(or perhaps on just the CPUs local to one Memory Node). These tasks
remain in the root cpuset, which by the scheme we're contemplating,
doesn't get a sched_domain in the fancier configurations.

Yup - you're right - these tasks will also want the scheduler to give
them CPU time when they need it. Hmmm ... logically this violates our
nice schemes, but seems that we are down to such a small exception case
that there must be some primitive way to workaround this.

We basically need to keep a list of the 4 or 5 per-cpu kernel threads,
and whenever we repartition the sched_domains, make sure that each such
kernel thread is bound to whatever sched_domain happens to be covering
that cpu. If we just wrote the code, and quit trying to find a grand
unifying theory to explain it consistently with the rest of our design,
it would probably work just fine.

The cpuset code would have to be careful, when it came time to list the
tasks attached to a cpuset (Workload Manager software is fond of this
call) _not_ to list these indigenous (not "migrant" !) worker threads.
And when listing the tasks in the root cpuset, _do_ include them.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.650.933.1373
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/