Re: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

From: K Prateek Nayak
Date: Wed Jan 24 2024 - 03:26:52 EST


Hello Vincent,

On 1/23/2024 7:09 PM, Vincent Guittot wrote:
> On Tue, 23 Jan 2024 at 11:01, K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
>>
>> Hello Vincent,
>>
>> On 1/23/2024 1:36 PM, Vincent Guittot wrote:
>>> On Tue, 23 Jan 2024 at 05:58, K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
>>>>
>>>> Hello Tim,
>>>>
>>>> On 1/23/2024 3:29 AM, Tim Chen wrote:
>>>>> On Fri, 2024-01-19 at 14:15 +0530, K Prateek Nayak wrote:
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index b803030c3a03..1fedc7e29c98 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -8499,6 +8499,16 @@ done: __maybe_unused;
>>>>>> if (!rf)
>>>>>> return NULL;
>>>>>>
>>>>>> + /*
>>>>>> + * An idle CPU in TIF_POLLING mode might end up here after processing
>>>>>> + * an IPI when the sender sets the TIF_NEED_RESCHED bit and avoids
>>>>>> + * sending an actual IPI. In such cases, where an idle CPU was woken
>>>>>> + * up only to process an interrupt, without necessarily queuing a task
>>>>>> + * on it, skip newidle_balance() to facilitate faster idle re-entry.
>>>>>> + */
>>>>>> + if (prev == rq->idle)
>>>>>> + return NULL;
>>>>>> +
>>>>>
>>>>> Should we check the call function queue directly to detect that there is
>>>>> an IPI waiting to be processed? something like
>>>>>
>>>>> if (!llist_empty(&per_cpu(call_single_queue, rq->cpu)))
>>>>> return NULL;
>>>>
>>>> That could be a valid check too. However, if an IPI is queued right
>>>> after this check, the processing is still delayed since
>>>> newidle_balance() only bails out for scenarios when a wakeup is trying
>>>> to queue a new task on the CPU running the newidle_balance().
>>>>
>>>>>
>>>>> Could there be cases where we want to do idle balance in this code path?
>>>>> Say a cpu is idle and a scheduling tick came in, we may try
>>>>> to look for something to run on the idle cpu. Seems like after
>>>>> your change above, that would be skipped.
>>>>
>>>> Wouldn't scheduler_tick() do load balancing when the time comes? In my
>>>> testing, I did not see a case where the workloads I tested were
>>>> sensitive to the aspect of newidle_balance() being invoked at scheduler
>>>> tick. Have you come across a workload which might be sensitive to this
>>>> aspect that I can quickly test and verify? Meanwhile, I'll run the
>>>> workloads mentioned in the commit log on an Intel system to see if I
>>>> can spot any sensitivity to this change.
>>>
>>> Instead of trying to fix spurious need_resched in the scheduler,
>>> can't we find a way to prevent it from happening ?
>>
>> The need_resched is not spurious. It is an effect of the optimization
>> introduced by commit b2a02fc43a1f ("smp: Optimize
>> send_call_function_single_ipi()") where, to pull a CPU out of
>> TIF_POLLING out of idle (and this happens for C0 (POLL) and C1 (MWAIT)
>> on the test machine), instead of sending an IPI for
>> smp_call_function_single(), the sender sets the TIF_NEED_RESCHED flag in
>> the idle task's thread info and in the path to "schedule_idle()", the
>> call to "flush_smp_call_function_queue()" processes the function call.
>
> I mean it's spurious in the sense that we don't need to resched but we
> need to pull the CPU out of the polling loop. At that time we don't
> know if there is a need to resched

Agreed.

>
>>
>> But since "TIF_NEED_RESCHED" was set to pull the CPU out of idle, the
>> scheduler now believes a new task exists which leads to the following
>> call stack:
>
> Exactly, TIF_NEED_RESCHED has been set so scheduler now believes it
> needs to look for a task. The solution is to not set TIF_NEED_RESCHED
> if you don't want the scheduler to look for a task including pulling
> it from another cpu
>
>>
>> do_idle()
>> schedule_idle()
>> __schedule(SM_NONE)
>> /* local_irq_disable() */
>> pick_next_task()
>> __pick_next_task()
>> pick_next_task_fair()
>> newidle_balance()
>> ... /* Still running with IRQs disabled */
>>
>> Since IRQs are disabled, the processing of IPIs are delayed leading
>> issue similar to the one outlined in commit 792b9f65a568 ("sched:
>> Allow newidle balancing to bail out of load_balance") when benchmarking
>> ipistorm.
>
> IMO it's not the same because commit 792b9f65a568 wants to abort early
> if something new happened

In case of commit 792b9f65a568, the newidle_balance() is cut short since
a pending IPI wants to queue a task which otherwise would have otherwise
led to a spike in tail latency. For ipistorm, there is a pending IPI
which is not queuing a task, but since the smp_call_function_single()
was invoked with wait=1, the sender will wait until the the the target
enables interrupts again and processes the call function which manifests
as a spike in tail latency.

>
>>
>>>
>>> Because of TIF_NEED_RESCHED being set when TIF_POLLING is set, idle
>>> load balances are already skipped for a less aggressive newly idle
>>> load balanced:
>>> https://lore.kernel.org/all/CAKfTPtC9Px_W84YRJqnFNkL8oofO15D-P=VTCMUUu7NJr+xwBA@xxxxxxxxxxxxxx/
>>
>> Are you referring to the "need_resched()" condition check in
>> "nohz_csd_func()"? Please correct me if I'm wrong.
>
> yes
>
>>
>> When I ran with sched-scoreboard
>> (https://github.com/AMDESE/sched-scoreboard/)with the patch on an idle
>> system for 60s I see the idle "load_balance count" go up in sched-stat
>
> If TIF_POLLING is not set, you will use normal IPI but otherwise, the
> wakeup for an idle load balance is skipped because need_resched is set
> and we have an newly idle load balance which you now want to skip too

I see! You are right.

>
>>
>> Following are the data for idle balance on SMT domain for each kernel:
>>
>> o tip:sched/core
>>
>> < ---------------------------------------- Category: idle ----------- >
>> load_balance count on cpu idle : 2678
>> load_balance found balanced on cpu idle : 2678
>> ->load_balance failed to find busier queue on cpu idle : 0
>> ->load_balance failed to find busier group on cpu idle : 2678
>> load_balance move task failed on cpu idle : 0
>> *load_balance success count on cpu idle : 0
>> imbalance sum on cpu idle : 0
>> pull_task count on cpu idle : 0
>> *avg task pulled per successfull lb attempt (cpu idle) : 0
>> ->pull_task when target task was cache-hot on cpu idle : 0
>> -------------------------------------------------------------------------
>>
>> o tip:sched/core + patch
>>
>> < ---------------------------------------- Category: idle ----------- >
>> load_balance count on cpu idle : 1895
>> load_balance found balanced on cpu idle : 1895
>> ->load_balance failed to find busier queue on cpu idle : 0
>> ->load_balance failed to find busier group on cpu idle : 1895
>> load_balance move task failed on cpu idle : 0
>> *load_balance success count on cpu idle : 0
>> imbalance sum on cpu idle : 0
>> pull_task count on cpu idle : 0
>> *avg task pulled per successfull lb attempt (cpu idle) : 0
>> ->pull_task when target task was cache-hot on cpu idle : 0
>> -------------------------------------------------------------------------
>>
>> Am I missing something? Since "load_balance count" is only updated when
>> "load_balance()" is called.
>>
>>>
>>> The root of the problem is that we keep TIF_NEED_RESCHED set
>>
>> We had prototyped a TIF_NEED_IPI flag to skip calls to schedule_idle()
>> on CPUs in TIF_POLLING when the idle CPU has to only process an IPI.
>> Although it solves the problem described in the commit log, it also
>> required enabling and testing it on multiple architectures.
>
> Yes, but that's the right solution IMO and it will prevent us to then
> try to catch the needless TIF_NEED_RESCHED

I'll discuss with Gautham on cleaning up the prototype and testing it
some more (we had only looked at ipistorm case) before sending out an
RFC.

> [..snip..]

--
Thanks and Regards,
Prateek