Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

From: Frederic Weisbecker
Date: Wed Mar 27 2024 - 10:15:17 EST


Le Tue, Mar 26, 2024 at 05:46:07PM +0100, Valentin Schneider a écrit :
> > Then with that patch I ran TREE07, just some short iterations:
> >
> > tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
> >
> > And the warning triggers very quickly. At least since v6.3 but maybe since
> > earlier. Is this expected behaviour or am I right to assume that
> > for_each_domain()/sched_domain_span() shouldn't return an offline CPU?
> >
>
> I would very much assume an offline CPU shouldn't show up in a
> sched_domain_span().
>
> Now, on top of the above, there's one more thing worth noting:
> cpu_up_down_serialize_trainwrecks()
>
> This just flushes the cpuset work, so after that the sched_domain topology
> should be sane. However I see it's invoked at the tail end of _cpu_down(),
> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
> around this vs. lock ordering aren't very reassuring however, so I need to
> look into this more.

Ouch...

>
> Maybe as a "quick" test to see if this is the right culprit, you could try
> that with CONFIG_CPUSET=n? Because in that case the sched_domain update is
> ran within sched_cpu_deactivate().

I just tried and I fear that doesn't help. It still triggers even without
cpusets :-s

Thanks.