Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

From: Waiman Long
Date: Fri Apr 14 2023 - 15:07:25 EST


On 4/14/23 13:38, Waiman Long wrote:
On 4/14/23 13:34, Tejun Heo wrote:
On Fri, Apr 14, 2023 at 01:29:25PM -0400, Waiman Long wrote:
On 4/14/23 12:54, Tejun Heo wrote:
On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
I now have a slightly different idea of how to do that. We already have an
internal cpumask for partitioning - subparts_cpus. I am thinking about
exposing it as cpuset.cpus.reserve. The current way of creating
subpartitions will be called automatic reservation and require a direct
parent/child partition relationship. But as soon as a user write anything to
it, it will break automatic reservation and require manual reservation going
forward.

In that way, we can keep the old behavior, but also support new use cases. I
am going to work on that.
I'm not sure I fully understand the proposed behavior but it does sound more
quirky.
The idea is to use the existing subparts_cpus for cpu reservation instead of
adding a new cpumask for that purpose. The current way of partition creation
does cpus reservation (setting subparts_cpus) automatically with the
constraint that the parent of a partition must be a partition root itself.
One way to relax this constraint is to allow a new manual reservation mode
where users can set reserve cpus manually and distribute them down the
hierarchy before activating a partition to use those cpus.

Now the question is how to enable this new manual reservation mode. One way
to do it is to enable it whenever the new cpuset.cpus.reserve file is
modified. Alternatively, we may enable it by a cgroupfs mount option or a
boot command line option.
It'd probably be best if we can keep the behavior within cgroupfs if
possible. Would you mind writing up the documentation section describing the
behavior beforehand? I think things would be clearer if we look at it from
the interface documentation side.

Sure, will do that. I need some time and so it will be early next week.

Just kidding :-)

Below is a draft of the new cpuset.cpus.reserve cgroupfs file:

  cpuset.cpus.reserve
        A read-write multiple values file which exists on all
        cpuset-enabled cgroups.

        It lists the reserved CPUs to be used for the creation of
        child partitions.  See the section on "cpuset.cpus.partition"
        below for more information on cpuset partition.  These reserved
        CPUs should be a subset of "cpuset.cpus" and will be mutually
        exclusive of "cpuset.cpus.effective" when used since these
        reserved CPUs cannot be used by tasks in the current cgroup.

        There are two modes for partition CPUs reservation -
        auto or manual.  The system starts up in auto mode where
        "cpuset.cpus.reserve" will be set automatically when valid
        child partitions are created and users don't need to touch the
        file at all.  This mode has the limitation that the parent of a
        partition must be a partition root itself.  So child partition
        has to be created one-by-one from the cgroup root down.

        To enable the creation of a partition down in the hierarchy
        without the intermediate cgroups to be partition roots, one
        has to turn on the manual reservation mode by writing directly
        to "cpuset.cpus.reserve" with a value different from its
        current value.  By distributing the reserve CPUs down the cgroup
        hierarchy to the parent of the target cgroup, this target cgroup
        can be switched to become a partition root if its "cpuset.cpus"
        is a subset of the set of valid reserve CPUs in its parent. The
        set of valid reserve CPUs is the set that are present in all
        its ancestors' "cpuset.cpus.reserve" up to cgroup root and
        which have not been allocated to another valid partition yet.

        Once manual reservation mode is enabled, a cgroup administrator
        must always set up "cpuset.cpus.reserve" files properly before
        a valid partition can be created. So this mode has more
        administrative overhead but with greater flexibility.

Cheers,
Longman