Re: cpusets not cpu hotplug aware

From: Paul Jackson
Date: Mon Aug 21 2006 - 16:59:46 EST

Anton wrote:

> Maybe the notifier is the right way to go, but it seems strange to
> create two copies of cpu_online_map (with the associated possibiliy of
> the two getting out of sync).

Every cpuset in the system, of which this top_cpuset is just the first,
has a set of cpus and memory nodes on which tasks in that cpuset are
allowed to operate. It's not just top_cpuset that we need to understand
how to relate to hotplug and unplug.

I'll bet there is more hidden state in mm/mempolicy.c, for mbind()
and set_mempolicy(), and in kernel/sched.c for the sched_setaffinity(),
which was derived from what memory nodes or cpus were online. For
example, I see several fields in 'struct mempolicy' that contain
node numbers in some form, and the 'cpus_allowed' field in the task
struct that sched_setaffinity sets.

How does hotplug and unplug interact with these various saved states?

> Its up to the cpusets code to register a hotplug notifier to update the
> top_cpuset maps.

That, or user level code, when it adds or removes a cpu or a memory
node, needs to be responsible for adding or removing that cpu or node
to or from whichever cpusets are affected.

For example, if you just added cpu 31, to a system that had been
running on cpus 0 to 30, you can add cpu 31 to the top cpuset by

mkdir /dev/cpuset # if not already done
mount -t cpuset cpuset /dev/cpuset # if not already done
/bin/echo 0-31 > /dev/cpsuet/cpus

> If cpuset_cpus_allowed doesnt return the current online mask and we want
> to schedule on a cpu that has been added since boot it looks like we
> will fail.

In general, on systems actually using cpusets, that -is- what should
happen. Just because a cpu was brought online doesn't mean it was
intended to be allowed in any given tasks current cpuset.

Granted, I would guess users of systems not using cpusets (but
still have cpusets configured in their kernel, as is common in some
distro kernels), would expect the behaviour you expected - bringing
a cpu (or memory node) on or offline would make it available (or
not) for something like a sched_setaffinity (or mbind/set_mempolicy)
immediately, without having to invoke some magic cpuset voodoo.

Offhand, this sounds to me like a choice of two modes of operation.

If you aren't actually using cpusets (so every task is in the
original top_cpuset) then you'd expect that cpuset to "get out
of the way", meaning top_cpuset (the only cpuset, in this case)
tracked whatever cpus and nodes were online at the moment.

If instead you start messing with cpusets (creating more than one
of them and moving tasks between them) then you'd expect cpusets
to be enforced, without automatically adding newly added cpus or
memory nodes to existing cpusets. Only the user knows which
cpusets should get the added cpus or memory nodes in this case.

I don't jump for joy over yet another modal state flag. But I don't see
a better alternative -- do you?

I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.925.600.0401
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at