Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

From: Tejun Heo
Date: Thu Mar 28 2024 - 22:10:37 EST


(cc'ing Waiman and quoting whole body)

Hello,

On Thu, Mar 28, 2024 at 09:39:56PM +0100, Valentin Schneider wrote:
> On 27/03/24 21:42, Thomas Gleixner wrote:
> > On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
> >> On 22/03/24 14:22, Frederic Weisbecker wrote:
> >>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
> >> Now, on top of the above, there's one more thing worth noting:
> >> cpu_up_down_serialize_trainwrecks()
> >>
> >> This just flushes the cpuset work, so after that the sched_domain topology
> >> should be sane. However I see it's invoked at the tail end of _cpu_down(),
> >> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
> >> around this vs. lock ordering aren't very reassuring however, so I need to
> >> look into this more.
> >
> > commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
> > the change log.
> >
> > TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
> > cpu_hotplug_lock in the hotplug path for whatever silly reason. So
> > trying to flush the work from within the cpu hotplug lock write held
> > region is guaranteed to dead lock.
> >
> > That's why all of this got deferred to a work queue. The flush at the
> > end of the hotplug code after dropping the hotplug lock is there to
> > prevent that user space sees the CPU hotplug uevent before the work is
> > done. But of course between bringing the CPU offline and the flush the
> > kernel internal state is inconsistent.
> >
>
> Thanks for the summary!
>
> > There was an attempt to make the CPU hotplug path synchronous in commit
> > a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
> > reverted with commit 2b729fe7f3 for the very wrong reason:
> >
> > https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@xxxxxx/T/#u
> >
> > (cpu_hotplug_lock){++++}-{0:0}:
> > lock_acquire+0xe4/0x25c
> > cpus_read_lock+0x50/0x154
> > static_key_slow_inc+0x18/0x30
> > mem_cgroup_css_alloc+0x824/0x8b0
> > cgroup_apply_control_enable+0x1d8/0x56c
> > cgroup_apply_control+0x40/0x344
> > cgroup_subtree_control_write+0x664/0x69c
> > cgroup_file_write+0x130/0x2e8
> > kernfs_fop_write+0x228/0x32c
> > __vfs_write+0x84/0x1d8
> > vfs_write+0x13c/0x1b4
> > ksys_write+0xb0/0x120
> >
> > Instead of the revert this should have been fixed by doing:
> >
> > + cpus_read_lock();
> > mutex_lock();
> > mem_cgroup_css_alloc();
> > - static_key_slow_inc();
> > + static_key_slow_inc_cpuslocked();
> >
>
> So looking at the state of things now, writing to the
> cgroup.subtree_control file looks like:
>
> cgroup_file_write()
> `\
> cgroup_subtree_control_write()
> `\
> cgroup_kn_lock_live()
> `\
> | cgroup_lock()
> | `\
> | mutex_lock(&cgroup_mutex);
> |
> cgroup_apply_control()
> `\
> cgroup_apply_control_enable()
> `\
> css_create()
> `\
> ss->css_alloc() [mem_cgroup_css_alloc()]
> `\
> static_branch_inc()
>
> and same with cgroup_mkdir(). So if we want to fix the ordering that caused
> the revert, we'd probably want to go for:
>
> static inline void cgroup_lock(void)
> {
> + cpus_read_lock();
> mutex_lock(&cgroup_mutex);
> }
>
> static inline void cgroup_unlock(void)
> {
> mutex_unlock(&cgroup_mutex);
> - cpus_read_unlock();
> }
>
> + a handful of +_cpuslocked() changes.
>
> As for cpuset, it looks like it's following this lock order:
>
> cpus_read_lock();
> mutex_lock(&cpuset_mutex);
>
> Which AFAICT is what we want.
>
> > Sorry that I did not notice this back then because I was too focussed on
> > fixing that uevent nonsense in a halfways sane way.
> >
> > After that revert cpuset locking became completely insane.
> >
> > cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
> > advanced part of that insanity. Seriously this commit is just tasteless
> > and disgusting demonstration of how to paper over a problem which is not
> > fully understood.
> >
> > After staring at it some more (including the history which led up to
> > these insanities) I really think that the CPU hotplug path can be made
> > truly synchronous and the memory hotplug path should just get the lock
> > ordering correct.
> >
> > Can we please fix this for real and get all of this insanity out of the
> > way
>
> Yes please!

Yeah, making that operation synchronous would be nice. Waiman, you're a lot
more familiar with this part than I am. Can you please take a look and see
whether we can turn the sched domain updates synchronous?

Thanks.

--
tejun