Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

From: Valentin Schneider
Date: Thu Mar 28 2024 - 16:41:00 EST


On 27/03/24 21:42, Thomas Gleixner wrote:
> On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
>> On 22/03/24 14:22, Frederic Weisbecker wrote:
>>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
>> Now, on top of the above, there's one more thing worth noting:
>> cpu_up_down_serialize_trainwrecks()
>>
>> This just flushes the cpuset work, so after that the sched_domain topology
>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>> around this vs. lock ordering aren't very reassuring however, so I need to
>> look into this more.
>
> commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
> the change log.
>
> TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
> cpu_hotplug_lock in the hotplug path for whatever silly reason. So
> trying to flush the work from within the cpu hotplug lock write held
> region is guaranteed to dead lock.
>
> That's why all of this got deferred to a work queue. The flush at the
> end of the hotplug code after dropping the hotplug lock is there to
> prevent that user space sees the CPU hotplug uevent before the work is
> done. But of course between bringing the CPU offline and the flush the
> kernel internal state is inconsistent.
>

Thanks for the summary!

> There was an attempt to make the CPU hotplug path synchronous in commit
> a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
> reverted with commit 2b729fe7f3 for the very wrong reason:
>
> https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@xxxxxx/T/#u
>
> (cpu_hotplug_lock){++++}-{0:0}:
> lock_acquire+0xe4/0x25c
> cpus_read_lock+0x50/0x154
> static_key_slow_inc+0x18/0x30
> mem_cgroup_css_alloc+0x824/0x8b0
> cgroup_apply_control_enable+0x1d8/0x56c
> cgroup_apply_control+0x40/0x344
> cgroup_subtree_control_write+0x664/0x69c
> cgroup_file_write+0x130/0x2e8
> kernfs_fop_write+0x228/0x32c
> __vfs_write+0x84/0x1d8
> vfs_write+0x13c/0x1b4
> ksys_write+0xb0/0x120
>
> Instead of the revert this should have been fixed by doing:
>
> + cpus_read_lock();
> mutex_lock();
> mem_cgroup_css_alloc();
> - static_key_slow_inc();
> + static_key_slow_inc_cpuslocked();
>

So looking at the state of things now, writing to the
cgroup.subtree_control file looks like:

cgroup_file_write()
`\
cgroup_subtree_control_write()
`\
cgroup_kn_lock_live()
`\
| cgroup_lock()
| `\
| mutex_lock(&cgroup_mutex);
|
cgroup_apply_control()
`\
cgroup_apply_control_enable()
`\
css_create()
`\
ss->css_alloc() [mem_cgroup_css_alloc()]
`\
static_branch_inc()

and same with cgroup_mkdir(). So if we want to fix the ordering that caused
the revert, we'd probably want to go for:

static inline void cgroup_lock(void)
{
+ cpus_read_lock();
mutex_lock(&cgroup_mutex);
}

static inline void cgroup_unlock(void)
{
mutex_unlock(&cgroup_mutex);
- cpus_read_unlock();
}

+ a handful of +_cpuslocked() changes.

As for cpuset, it looks like it's following this lock order:

cpus_read_lock();
mutex_lock(&cpuset_mutex);

Which AFAICT is what we want.

> Sorry that I did not notice this back then because I was too focussed on
> fixing that uevent nonsense in a halfways sane way.
>
> After that revert cpuset locking became completely insane.
>
> cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
> advanced part of that insanity. Seriously this commit is just tasteless
> and disgusting demonstration of how to paper over a problem which is not
> fully understood.
>
> After staring at it some more (including the history which led up to
> these insanities) I really think that the CPU hotplug path can be made
> truly synchronous and the memory hotplug path should just get the lock
> ordering correct.
>
> Can we please fix this for real and get all of this insanity out of the
> way

Yes please!

> including the nonsensical comments in the cpuset code:
>
> * Call with cpuset_mutex held. Takes cpus_read_lock().
>
> ...
>
> lockdep_assert_cpus_held();
> lockdep_assert_held(&cpuset_mutex);
>
> Oh well...
>
> Thanks,
>
> tglx