Re: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

From: K Prateek Nayak
Date: Tue Jan 23 2024 - 05:01:19 EST


Hello Vincent,

On 1/23/2024 1:36 PM, Vincent Guittot wrote:
> On Tue, 23 Jan 2024 at 05:58, K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
>>
>> Hello Tim,
>>
>> On 1/23/2024 3:29 AM, Tim Chen wrote:
>>> On Fri, 2024-01-19 at 14:15 +0530, K Prateek Nayak wrote:
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index b803030c3a03..1fedc7e29c98 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -8499,6 +8499,16 @@ done: __maybe_unused;
>>>> if (!rf)
>>>> return NULL;
>>>>
>>>> + /*
>>>> + * An idle CPU in TIF_POLLING mode might end up here after processing
>>>> + * an IPI when the sender sets the TIF_NEED_RESCHED bit and avoids
>>>> + * sending an actual IPI. In such cases, where an idle CPU was woken
>>>> + * up only to process an interrupt, without necessarily queuing a task
>>>> + * on it, skip newidle_balance() to facilitate faster idle re-entry.
>>>> + */
>>>> + if (prev == rq->idle)
>>>> + return NULL;
>>>> +
>>>
>>> Should we check the call function queue directly to detect that there is
>>> an IPI waiting to be processed? something like
>>>
>>> if (!llist_empty(&per_cpu(call_single_queue, rq->cpu)))
>>> return NULL;
>>
>> That could be a valid check too. However, if an IPI is queued right
>> after this check, the processing is still delayed since
>> newidle_balance() only bails out for scenarios when a wakeup is trying
>> to queue a new task on the CPU running the newidle_balance().
>>
>>>
>>> Could there be cases where we want to do idle balance in this code path?
>>> Say a cpu is idle and a scheduling tick came in, we may try
>>> to look for something to run on the idle cpu. Seems like after
>>> your change above, that would be skipped.
>>
>> Wouldn't scheduler_tick() do load balancing when the time comes? In my
>> testing, I did not see a case where the workloads I tested were
>> sensitive to the aspect of newidle_balance() being invoked at scheduler
>> tick. Have you come across a workload which might be sensitive to this
>> aspect that I can quickly test and verify? Meanwhile, I'll run the
>> workloads mentioned in the commit log on an Intel system to see if I
>> can spot any sensitivity to this change.
>
> Instead of trying to fix spurious need_resched in the scheduler,
> can't we find a way to prevent it from happening ?

The need_resched is not spurious. It is an effect of the optimization
introduced by commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") where, to pull a CPU out of
TIF_POLLING out of idle (and this happens for C0 (POLL) and C1 (MWAIT)
on the test machine), instead of sending an IPI for
smp_call_function_single(), the sender sets the TIF_NEED_RESCHED flag in
the idle task's thread info and in the path to "schedule_idle()", the
call to "flush_smp_call_function_queue()" processes the function call.

But since "TIF_NEED_RESCHED" was set to pull the CPU out of idle, the
scheduler now believes a new task exists which leads to the following
call stack:

do_idle()
schedule_idle()
__schedule(SM_NONE)
/* local_irq_disable() */
pick_next_task()
__pick_next_task()
pick_next_task_fair()
newidle_balance()
... /* Still running with IRQs disabled */

Since IRQs are disabled, the processing of IPIs are delayed leading
issue similar to the one outlined in commit 792b9f65a568 ("sched:
Allow newidle balancing to bail out of load_balance") when benchmarking
ipistorm.

>
> Because of TIF_NEED_RESCHED being set when TIF_POLLING is set, idle
> load balances are already skipped for a less aggressive newly idle
> load balanced:
> https://lore.kernel.org/all/CAKfTPtC9Px_W84YRJqnFNkL8oofO15D-P=VTCMUUu7NJr+xwBA@xxxxxxxxxxxxxx/

Are you referring to the "need_resched()" condition check in
"nohz_csd_func()"? Please correct me if I'm wrong.

When I ran with sched-scoreboard
(https://github.com/AMDESE/sched-scoreboard/)with the patch on an idle
system for 60s I see the idle "load_balance count" go up in sched-stat

Following are the data for idle balance on SMT domain for each kernel:

o tip:sched/core

< ---------------------------------------- Category: idle ----------- >
load_balance count on cpu idle : 2678
load_balance found balanced on cpu idle : 2678
->load_balance failed to find busier queue on cpu idle : 0
->load_balance failed to find busier group on cpu idle : 2678
load_balance move task failed on cpu idle : 0
*load_balance success count on cpu idle : 0
imbalance sum on cpu idle : 0
pull_task count on cpu idle : 0
*avg task pulled per successfull lb attempt (cpu idle) : 0
->pull_task when target task was cache-hot on cpu idle : 0
-------------------------------------------------------------------------

o tip:sched/core + patch

< ---------------------------------------- Category: idle ----------- >
load_balance count on cpu idle : 1895
load_balance found balanced on cpu idle : 1895
->load_balance failed to find busier queue on cpu idle : 0
->load_balance failed to find busier group on cpu idle : 1895
load_balance move task failed on cpu idle : 0
*load_balance success count on cpu idle : 0
imbalance sum on cpu idle : 0
pull_task count on cpu idle : 0
*avg task pulled per successfull lb attempt (cpu idle) : 0
->pull_task when target task was cache-hot on cpu idle : 0
-------------------------------------------------------------------------

Am I missing something? Since "load_balance count" is only updated when
"load_balance()" is called.

>
> The root of the problem is that we keep TIF_NEED_RESCHED set

We had prototyped a TIF_NEED_IPI flag to skip calls to schedule_idle()
on CPUs in TIF_POLLING when the idle CPU has to only process an IPI.
Although it solves the problem described in the commit log, it also
required enabling and testing it on multiple architectures.

$ grep -r "_TIF_POLLING" arch/ | cut -d '/' -f2 | uniq
csky
x86
powerpc
parisc
openrisc
sparc
nios2
microblaze
sh
alpha

This optimization in the scheduler was the simpler of the two to achieve
the same result in case of ipistorm.

>
>>
>> [..snip..]

--
Thanks and Regards,
Prateek