Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

From: Waiman Long
Date: Fri Mar 29 2024 - 13:06:50 EST


On 3/28/24 22:08, Tejun Heo wrote:
(cc'ing Waiman and quoting whole body)

Hello,

On Thu, Mar 28, 2024 at 09:39:56PM +0100, Valentin Schneider wrote:
On 27/03/24 21:42, Thomas Gleixner wrote:
On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
On 22/03/24 14:22, Frederic Weisbecker wrote:
On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
Now, on top of the above, there's one more thing worth noting:
cpu_up_down_serialize_trainwrecks()

This just flushes the cpuset work, so after that the sched_domain topology
should be sane. However I see it's invoked at the tail end of _cpu_down(),
IOW /after/ takedown_cpu() has run, which sounds too late. The comments
around this vs. lock ordering aren't very reassuring however, so I need to
look into this more.
commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
the change log.

TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
cpu_hotplug_lock in the hotplug path for whatever silly reason. So
trying to flush the work from within the cpu hotplug lock write held
region is guaranteed to dead lock.

That's why all of this got deferred to a work queue. The flush at the
end of the hotplug code after dropping the hotplug lock is there to
prevent that user space sees the CPU hotplug uevent before the work is
done. But of course between bringing the CPU offline and the flush the
kernel internal state is inconsistent.

Thanks for the summary!

There was an attempt to make the CPU hotplug path synchronous in commit
a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
reverted with commit 2b729fe7f3 for the very wrong reason:

https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@xxxxxx/T/#u

(cpu_hotplug_lock){++++}-{0:0}:
lock_acquire+0xe4/0x25c
cpus_read_lock+0x50/0x154
static_key_slow_inc+0x18/0x30
mem_cgroup_css_alloc+0x824/0x8b0
cgroup_apply_control_enable+0x1d8/0x56c
cgroup_apply_control+0x40/0x344
cgroup_subtree_control_write+0x664/0x69c
cgroup_file_write+0x130/0x2e8
kernfs_fop_write+0x228/0x32c
__vfs_write+0x84/0x1d8
vfs_write+0x13c/0x1b4
ksys_write+0xb0/0x120

Instead of the revert this should have been fixed by doing:

+ cpus_read_lock();
mutex_lock();
mem_cgroup_css_alloc();
- static_key_slow_inc();
+ static_key_slow_inc_cpuslocked();

So looking at the state of things now, writing to the
cgroup.subtree_control file looks like:

cgroup_file_write()
`\
cgroup_subtree_control_write()
`\
cgroup_kn_lock_live()
`\
| cgroup_lock()
| `\
| mutex_lock(&cgroup_mutex);
|
cgroup_apply_control()
`\
cgroup_apply_control_enable()
`\
css_create()
`\
ss->css_alloc() [mem_cgroup_css_alloc()]
`\
static_branch_inc()

and same with cgroup_mkdir(). So if we want to fix the ordering that caused
the revert, we'd probably want to go for:

static inline void cgroup_lock(void)
{
+ cpus_read_lock();
mutex_lock(&cgroup_mutex);
}

static inline void cgroup_unlock(void)
{
mutex_unlock(&cgroup_mutex);
- cpus_read_unlock();
}

+ a handful of +_cpuslocked() changes.

As for cpuset, it looks like it's following this lock order:

cpus_read_lock();
mutex_lock(&cpuset_mutex);

Which AFAICT is what we want.

Sorry that I did not notice this back then because I was too focussed on
fixing that uevent nonsense in a halfways sane way.

After that revert cpuset locking became completely insane.

cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
advanced part of that insanity. Seriously this commit is just tasteless
and disgusting demonstration of how to paper over a problem which is not
fully understood.

After staring at it some more (including the history which led up to
these insanities) I really think that the CPU hotplug path can be made
truly synchronous and the memory hotplug path should just get the lock
ordering correct.

Can we please fix this for real and get all of this insanity out of the
way
Yes please!
Yeah, making that operation synchronous would be nice. Waiman, you're a lot
more familiar with this part than I am. Can you please take a look and see
whether we can turn the sched domain updates synchronous?

Sure. I will take a look on how to make cpu hotplug sched domain update synchronous.

Cheers,
Longman