Re: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

From: Vincent Guittot
Date: Tue Jan 23 2024 - 08:39:29 EST


On Tue, 23 Jan 2024 at 11:01, K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
>
> Hello Vincent,
>
> On 1/23/2024 1:36 PM, Vincent Guittot wrote:
> > On Tue, 23 Jan 2024 at 05:58, K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
> >>
> >> Hello Tim,
> >>
> >> On 1/23/2024 3:29 AM, Tim Chen wrote:
> >>> On Fri, 2024-01-19 at 14:15 +0530, K Prateek Nayak wrote:
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index b803030c3a03..1fedc7e29c98 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -8499,6 +8499,16 @@ done: __maybe_unused;
> >>>> if (!rf)
> >>>> return NULL;
> >>>>
> >>>> + /*
> >>>> + * An idle CPU in TIF_POLLING mode might end up here after processing
> >>>> + * an IPI when the sender sets the TIF_NEED_RESCHED bit and avoids
> >>>> + * sending an actual IPI. In such cases, where an idle CPU was woken
> >>>> + * up only to process an interrupt, without necessarily queuing a task
> >>>> + * on it, skip newidle_balance() to facilitate faster idle re-entry.
> >>>> + */
> >>>> + if (prev == rq->idle)
> >>>> + return NULL;
> >>>> +
> >>>
> >>> Should we check the call function queue directly to detect that there is
> >>> an IPI waiting to be processed? something like
> >>>
> >>> if (!llist_empty(&per_cpu(call_single_queue, rq->cpu)))
> >>> return NULL;
> >>
> >> That could be a valid check too. However, if an IPI is queued right
> >> after this check, the processing is still delayed since
> >> newidle_balance() only bails out for scenarios when a wakeup is trying
> >> to queue a new task on the CPU running the newidle_balance().
> >>
> >>>
> >>> Could there be cases where we want to do idle balance in this code path?
> >>> Say a cpu is idle and a scheduling tick came in, we may try
> >>> to look for something to run on the idle cpu. Seems like after
> >>> your change above, that would be skipped.
> >>
> >> Wouldn't scheduler_tick() do load balancing when the time comes? In my
> >> testing, I did not see a case where the workloads I tested were
> >> sensitive to the aspect of newidle_balance() being invoked at scheduler
> >> tick. Have you come across a workload which might be sensitive to this
> >> aspect that I can quickly test and verify? Meanwhile, I'll run the
> >> workloads mentioned in the commit log on an Intel system to see if I
> >> can spot any sensitivity to this change.
> >
> > Instead of trying to fix spurious need_resched in the scheduler,
> > can't we find a way to prevent it from happening ?
>
> The need_resched is not spurious. It is an effect of the optimization
> introduced by commit b2a02fc43a1f ("smp: Optimize
> send_call_function_single_ipi()") where, to pull a CPU out of
> TIF_POLLING out of idle (and this happens for C0 (POLL) and C1 (MWAIT)
> on the test machine), instead of sending an IPI for
> smp_call_function_single(), the sender sets the TIF_NEED_RESCHED flag in
> the idle task's thread info and in the path to "schedule_idle()", the
> call to "flush_smp_call_function_queue()" processes the function call.

I mean it's spurious in the sense that we don't need to resched but we
need to pull the CPU out of the polling loop. At that time we don't
know if there is a need to resched

>
> But since "TIF_NEED_RESCHED" was set to pull the CPU out of idle, the
> scheduler now believes a new task exists which leads to the following
> call stack:

Exactly, TIF_NEED_RESCHED has been set so scheduler now believes it
needs to look for a task. The solution is to not set TIF_NEED_RESCHED
if you don't want the scheduler to look for a task including pulling
it from another cpu

>
> do_idle()
> schedule_idle()
> __schedule(SM_NONE)
> /* local_irq_disable() */
> pick_next_task()
> __pick_next_task()
> pick_next_task_fair()
> newidle_balance()
> ... /* Still running with IRQs disabled */
>
> Since IRQs are disabled, the processing of IPIs are delayed leading
> issue similar to the one outlined in commit 792b9f65a568 ("sched:
> Allow newidle balancing to bail out of load_balance") when benchmarking
> ipistorm.

IMO it's not the same because commit 792b9f65a568 wants to abort early
if something new happened

>
> >
> > Because of TIF_NEED_RESCHED being set when TIF_POLLING is set, idle
> > load balances are already skipped for a less aggressive newly idle
> > load balanced:
> > https://lore.kernel.org/all/CAKfTPtC9Px_W84YRJqnFNkL8oofO15D-P=VTCMUUu7NJr+xwBA@xxxxxxxxxxxxxx/
>
> Are you referring to the "need_resched()" condition check in
> "nohz_csd_func()"? Please correct me if I'm wrong.

yes

>
> When I ran with sched-scoreboard
> (https://github.com/AMDESE/sched-scoreboard/)with the patch on an idle
> system for 60s I see the idle "load_balance count" go up in sched-stat

If TIF_POLLING is not set, you will use normal IPI but otherwise, the
wakeup for an idle load balance is skipped because need_resched is set
and we have an newly idle load balance which you now want to skip too

>
> Following are the data for idle balance on SMT domain for each kernel:
>
> o tip:sched/core
>
> < ---------------------------------------- Category: idle ----------- >
> load_balance count on cpu idle : 2678
> load_balance found balanced on cpu idle : 2678
> ->load_balance failed to find busier queue on cpu idle : 0
> ->load_balance failed to find busier group on cpu idle : 2678
> load_balance move task failed on cpu idle : 0
> *load_balance success count on cpu idle : 0
> imbalance sum on cpu idle : 0
> pull_task count on cpu idle : 0
> *avg task pulled per successfull lb attempt (cpu idle) : 0
> ->pull_task when target task was cache-hot on cpu idle : 0
> -------------------------------------------------------------------------
>
> o tip:sched/core + patch
>
> < ---------------------------------------- Category: idle ----------- >
> load_balance count on cpu idle : 1895
> load_balance found balanced on cpu idle : 1895
> ->load_balance failed to find busier queue on cpu idle : 0
> ->load_balance failed to find busier group on cpu idle : 1895
> load_balance move task failed on cpu idle : 0
> *load_balance success count on cpu idle : 0
> imbalance sum on cpu idle : 0
> pull_task count on cpu idle : 0
> *avg task pulled per successfull lb attempt (cpu idle) : 0
> ->pull_task when target task was cache-hot on cpu idle : 0
> -------------------------------------------------------------------------
>
> Am I missing something? Since "load_balance count" is only updated when
> "load_balance()" is called.
>
> >
> > The root of the problem is that we keep TIF_NEED_RESCHED set
>
> We had prototyped a TIF_NEED_IPI flag to skip calls to schedule_idle()
> on CPUs in TIF_POLLING when the idle CPU has to only process an IPI.
> Although it solves the problem described in the commit log, it also
> required enabling and testing it on multiple architectures.

Yes, but that's the right solution IMO and it will prevent us to then
try to catch the needless TIF_NEED_RESCHED

>
> $ grep -r "_TIF_POLLING" arch/ | cut -d '/' -f2 | uniq
> csky
> x86
> powerpc
> parisc
> openrisc
> sparc
> nios2
> microblaze
> sh
> alpha
>
> This optimization in the scheduler was the simpler of the two to achieve
> the same result in case of ipistorm.
>
> >
> >>
> >> [..snip..]
>
> --
> Thanks and Regards,
> Prateek