Re: [PATCH] cpuset: fix race between hotplug work and later CPU offline

From: Peter Zijlstra
Date: Tue Nov 10 2020 - 11:45:30 EST


On Thu, Oct 29, 2020 at 02:18:45PM -0400, Daniel Jordan wrote:
> One of our machines keeled over trying to rebuild the scheduler domains.
> Mainline produces the same splat:
>
> BUG: unable to handle page fault for address: 0000607f820054db
> CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
> Workqueue: events cpuset_hotplug_workfn
> RIP: build_sched_domains
> Call Trace:
> partition_sched_domains_locked
> rebuild_sched_domains_locked
> cpuset_hotplug_workfn
>
> It happens with cgroup2 and exclusive cpusets only. This reproducer
> triggers it on an 8-cpu vm and works most effectively with no
> preexisting child cgroups:
>
> cd $UNIFIED_ROOT
> mkdir cg1
> echo 4-7 > cg1/cpuset.cpus
> echo root > cg1/cpuset.cpus.partition
>
> # with smt/control reading 'on',
> echo off > /sys/devices/system/cpu/smt/control
>
> RIP maps to
>
> sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
>
> from sd_init(). sd_id is calculated earlier in the same function:
>
> cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
> sd_id = cpumask_first(sched_domain_span(sd));
>
> tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
> and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
> value from per_cpu_ptr() above.
>
> The problem is a race between cpuset_hotplug_workfn() and a later
> offline of CPU N. cpuset_hotplug_workfn() updates the effective masks
> when N is still online, the offline clears N from cpu_sibling_map, and
> then the worker uses the stale effective masks that still have N to
> generate the scheduling domains, leading the worker to read
> N's empty cpu_sibling_map in sd_init().
>
> rebuild_sched_domains_locked() prevented the race during the cgroup2
> cpuset series up until the Fixes commit changed its check. Make the
> check more robust so that it can detect an offline CPU in any exclusive
> cpuset's effective mask, not just the top one.

*groan*, what a mess...

> Fixes: 0ccea8feb980 ("cpuset: Make generate_sched_domains() work with partition")
> Signed-off-by: Daniel Jordan <daniel.m.jordan@xxxxxxxxxx>
> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Li Zefan <lizefan@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Prateek Sood <prsood@xxxxxxxxxxxxxx>
> Cc: Tejun Heo <tj@xxxxxxxxxx>
> Cc: Waiman Long <longman@xxxxxxxxxx>
> Cc: cgroups@xxxxxxxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Cc: stable@xxxxxxxxxxxxxxx
> ---
>
> I think the right thing to do long-term is make the hotplug work
> synchronous, fixing the lockdep splats of past attempts, and then take
> these checks out of rebuild_sched_domains_locked, but this fixes the
> immediate issue and is small enough for stable. Open to suggestions.
>
> Prateek, are you planning on picking up your patches again?

Yeah, that might help, but those deadlocks were nasty iirc :/

> kernel/cgroup/cpuset.c | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 57b5b5d0a5fd..ac3124010b2a 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -983,8 +983,10 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> */
> static void rebuild_sched_domains_locked(void)
> {
> + struct cgroup_subsys_state *pos_css;
> struct sched_domain_attr *attr;
> cpumask_var_t *doms;
> + struct cpuset *cs;
> int ndoms;
>
> lockdep_assert_cpus_held();
> @@ -999,9 +1001,21 @@ static void rebuild_sched_domains_locked(void)
> !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
> return;

So you argued above that effective_cpus was stale, I suppose the above
one works because its an equality test instead of a subset? Does that
wants a comment?

> - if (top_cpuset.nr_subparts_cpus &&
> - !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
> - return;
> + if (top_cpuset.nr_subparts_cpus) {
> + rcu_read_lock();
> + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
> + if (!is_partition_root(cs)) {
> + pos_css = css_rightmost_descendant(pos_css);
> + continue;
> + }
> + if (!cpumask_subset(cs->effective_cpus,
> + cpu_active_mask)) {
> + rcu_read_unlock();
> + return;
> + }
> + }
> + rcu_read_unlock();
> + }
>
> /* Generate domain masks and attrs */
> ndoms = generate_sched_domains(&doms, &attr);