Re: [PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB

From: Joel Fernandes
Date: Fri Oct 06 2023 - 12:32:23 EST


On Fri, Oct 06, 2023 at 12:51:41PM +0200, Ingo Molnar wrote:
>
> * Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx> wrote:
>
> > From: Vineeth Pillai <vineethrp@xxxxxxxxxx>
> >
> > Whenever a CPU stops its tick, it now requires another idle CPU to handle the
> > balancing for it because it can't perform its own periodic load balancing.
> > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
> > the upcoming nohz-idle load balancing is too distant in the future. This update
> > process is done by triggering an ILB, as the general ILB handler
> > (_nohz_idle_balance) that manages regular nohz balancing also refreshes
> > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
> > and selecting the smallest value.
> >

[...snip...]

> > kernel/sched/fair.c | 21 ++++++++++++++-------
> > 1 file changed, 14 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index cb225921bbca..2ece55f32782 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu)
> > /*
> > * Ensures that if nohz_idle_balance() fails to observe our
> > * @idle_cpus_mask store, it must observe the @has_blocked
> > - * and @needs_update stores.
> > + * stores.
> > */
> > smp_mb__after_atomic();
> >
> > set_cpu_sd_state_idle(cpu);
> >
> > - WRITE_ONCE(nohz.needs_update, 1);
> > out:
> > /*
> > * Each time a cpu enter idle, we assume that it has blocked load and
> > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> > }
> >
> > /*
> > - * Check if we need to run the ILB for updating blocked load before entering
> > - * idle state.
> > + * Check if we need to run the ILB for updating blocked load and/or updating
> > + * nohz.next_balance before entering idle state.
> > */
> > void nohz_run_idle_balance(int cpu)
> > {
> > unsigned int flags;
> >
> > - flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
> > + flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu));
> > +
> > + if (!flags)
> > + return;
> >
> > /*
> > * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
> > * (ie NOHZ_STATS_KICK set) and will do the same.
> > */
> > - if ((flags == NOHZ_NEWILB_KICK) && !need_resched())
> > - _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK);
> > + if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
> > + !need_resched())
> > + _nohz_idle_balance(cpu_rq(cpu), flags);
> > }
> >
> > static void nohz_newidle_balance(struct rq *this_rq)
> > @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq)
> > if (this_rq->avg_idle < sysctl_sched_migration_cost)
> > return;
> >
> > + /* If rq->next_balance before nohz.next_balance, trigger ILB */
> > + if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance)))
> > + atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu));
> > +
> > /* Don't need to update blocked load of idle CPUs*/
> > if (!READ_ONCE(nohz.has_blocked) ||
> > time_before(jiffies, READ_ONCE(nohz.next_blocked)))
>
> Ok, judging by your IPI reduction numbers this is definitely an
> optimization we want to do.

Great, thanks.

> The patch does make _nohz_idle_balance() run more parallel, as previously
> it would be generally run by the first-idle CPU in nohz.idle_cpus_mask (at
> least for next_balance updates), but I think it's still SMP-safe, as all
> key data structure updates are already rq-locked AFAICS.

Yes true, we are looking into the parallelism aspect more and will update on
how it goes. Ideally, we'd like to ensure that nohz.next_balance is set to
the earliest rq->next_balance even in the presence of concurrency.

Theoretically, we feel the parallelism should not increase more than the
current code but we'll look more into it.

> One thing I noticed is that we now use nohz.needs_update only in a single
> remaining case, when _nohz_idle_balance() "self-defers":
>
> /*
> * If this CPU gets work to do, stop the load balancing
> * work being done for other CPUs. Next load
> * balancing owner will pick it up.
> */
> if (need_resched()) {
> if (flags & NOHZ_STATS_KICK)
> has_blocked_load = true;
> if (flags & NOHZ_NEXT_KICK)
> WRITE_ONCE(nohz.needs_update, 1);
> goto abort;
> }
>
> Getting a need-resched flag set on this CPU is a pretty dubious reason to
> skip an ILB run IMO, and we could do entirely without that complication,
> allowing us to remove the nohz.needs_update flag handling logic altogether?

Yes you are right I think, we can continue doing the ILB run in the case we
are only doing lighter-weight stats update and not full-blown idle balancing.

if (need_resched() && (flags & NOHZ_BALANCE_KICK))
goto abort;

That way we can get rid of the needs_update variable as well, as you and
Vincent pointed out. We could also add this as a separate patch in a series.
Thanks for pointing out this idea.

> If we do that then the !need_resched() flag needs to go from
> nohz_run_idle_balance() too:
>
> /*
> * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
> * (ie NOHZ_STATS_KICK set) and will do the same.
> */
> if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
> !need_resched())
> _nohz_idle_balance(cpu_rq(cpu), flags);
>
> ... if I'm reading the code right that is.

Yes, that sounds right.

We will work more on this and post the next revision soon. Thank you!

- Joel & Vineeth