Re: [PATCH 4/4] Revert "kernel/sched: Modify initial boot task idle setup"

From: Frederic Weisbecker
Date: Fri Oct 20 2023 - 08:31:22 EST


On Fri, Oct 20, 2023 at 10:25:31AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 20, 2023 at 01:35:43AM +0200, Frederic Weisbecker wrote:
> > Now that rcutiny can deal with early boot PF_IDLE setting, revert
> > commit cff9b2332ab762b7e0586c793c431a8f2ea4db04.
> >
> > This fixes several subtle issues introduced on RCU-tasks(-trace):
> >
> > 1) RCU-tasks stalls when:
> >
> > 1.1 Grace period is started before init/0 had a chance to set PF_IDLE,
> > keeping it stuck in the holdout list until idle ever schedules.
> >
> > 1.2 Grace period is started when some possible CPUs have never been
> > online, keeping their idle tasks stuck in the holdout list until
> > the CPU ever boots up.
> >
> > 1.3 Similar to 1.1 but with secondary CPUs: Grace period is started
> > concurrently with secondary CPU booting, putting its idle task in
> > the holdout list because PF_IDLE isn't yet observed on it. It
> > stays then stuck in the holdout list until that CPU ever
> > schedules. The effect is mitigated here by all the smpboot
> > kthreads and the hotplug AP thread that must run to bring the
> > CPU up.
> >
> > 2) Spurious warning on RCU task trace that assumes offline CPU's idle
> > task is always PF_IDLE.
> >
> > More issues have been found in RCU-tasks related to PF_IDLE which should
> > be fixed with later changes as those are not regressions:
> >
> > 3) The RCU-Tasks semantics consider the idle loop as a quiescent state,
> > however:
> >
> > 3.1 The boot code preceding the idle entry is included in this
> > quiescent state. Especially after the completion of kthreadd_done
> > after which init/1 can launch userspace concurrently. The window
> > is tiny before PF_IDLE is set but it exists.
> >
> > 3.2 Similarly, the boot code preceding the idle entry on secondary
> > CPUs is wrongly accounted as RCU tasks quiescent state.
> >
>
> Urgh... so the plan is to fix RCU-tasks for all of the above to not rely
> on PF_IDLE ? Because I rather like the more strict PF_IDLE and
> subsequently don't much like this revert.

Yeah I can't say I really like the old coverage of PF_IDLE either. The new one
(after Liam's patch) is only halfway better defined though: it makes the boot
CPU's idle behave quite well: PF_IDLE is set on idle entry. And secondary
CPU's idle behave quite well also except when they go offline and then online
again. And then the secondary boot code becomes PF_IDLE.

We probably need something like this:

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 3b9d5c7eb4a2..b24d7937b989 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1394,7 +1394,9 @@ void cpuhp_report_idle_dead(void)
{
struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);

+ current->flags &= ~PF_IDLE;
BUG_ON(st->state != CPUHP_AP_OFFLINE);
+
rcutree_report_cpu_dead();
st->state = CPUHP_AP_IDLE_DEAD;
/*
@@ -1642,6 +1644,8 @@ void cpuhp_online_idle(enum cpuhp_state state)
{
struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);

+ current->flags |= PF_IDLE;
+
/* Happens for the boot cpu */
if (state != CPUHP_AP_ONLINE_IDLE)
return;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 5007b25c5bc6..342f58a329f5 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,7 +373,6 @@ EXPORT_SYMBOL_GPL(play_idle_precise);

void cpu_startup_entry(enum cpuhp_state state)
{
- current->flags |= PF_IDLE;
arch_cpu_idle_prepare();
cpuhp_online_idle(state);
while (1)


And then RCU-tasks can better rely on it as it really draws correctly
the idle coverage. As for working on top of that, we have thought about
solutions, lemme try to make them proper.

Thanks.