Re: [PATCH v2 2/2] cgroup/cpuset: Don't update tasks' cpumasks for cpu offline events

From: Waiman Long
Date: Sat Feb 04 2023 - 23:42:35 EST



On 2/4/23 04:40, Peter Zijlstra wrote:
On Thu, Feb 02, 2023 at 09:32:00AM -0500, Waiman Long wrote:
It is a known issue that when a task is in a non-root v1 cpuset, a cpu
offline event will cause that cpu to be lost from the task's cpumask
permanently as the cpuset's cpus_allowed mask won't get back that cpu
when it becomes online again. A possible workaround for this type of
cpu offline/online sequence is to leave the offline cpu in the task's
cpumask and do the update only if new cpus are added. It also has the
benefit of reducing the overhead of a cpu offline event.

Note that the scheduler is able to ignore the offline cpus and so
leaving offline cpus in the cpumask won't do any harm.

Now with v2, only the cpu online events will cause a call to
hotplug_update_tasks() to update the tasks' cpumasks. For tasks
in a non-root v1 cpuset, the situation is a bit different. The cpu
offline event will not cause change to a task's cpumask. Neither does a
subsequent cpu online event because "cpuset.cpus" had that offline cpu
removed and its subsequent onlining won't be registered as a change
to the cpuset. An exception is when all the cpus in the original
"cpuset.cpus" have gone offline once. In that case, "cpuset.cpus" will
become empty which will force task migration to its parent. A task's
cpumask will also be changed if set_cpus_allowed_ptr() is somehow called
for whatever reason.

Of course, this patch can cause a discrepancy between v1's "cpuset.cpus"
and and its tasks' cpumasks. Howver, it can also largely work around
the offline cpu losing problem with v1 cpuset.
I don't thikn you can simply not update on offline, even if
effective_cpus doesn't go empty, because the intersection between
task_cpu_possible_mask() and effective_cpus might have gone empty.

Very fundamentally, the introduction of task_cpu_possible_mask() means
that you now *HAVE* to always consider affinity settings per-task, you
cannot group them anymore.

Right, it makes sense to me. That is why I am thinking that we should have an API like may_have_task_cpu_possible_mask() that returns true for heterogeneous systems. That will allow us to apply some optimizations in systems with homogeneous cpus. So far, this is an arm64 only feature. We shouldn't penalize other arches because arm64 needs that. In the future, maybe more arches will have that.

Cheers,
Longman