Re: [RFC PATCH 00/14] Introducing TIF_NOTIFY_IPI flag

From: K Prateek Nayak
Date: Fri Mar 15 2024 - 02:38:35 EST


(Trimming the cc list to only include scheduler folks)

Hello Julia,

On 3/8/2024 1:26 AM, Julia Lawall wrote:
>
>
> On Wed, 6 Mar 2024, Vincent Guittot wrote:
>
>> Hi Prateek,
>>
>> Adding Julia who could be interested in this patchset. Your patchset
>> should trigger idle load balance instead of newly idle load balance
>> now when the polling is used. This was one reason for not migrating
>> task in idle CPU
>
> My situation is roughly as follows:
>
> The machine is an Intel 6130 with two sockets and 32 hardware threads
> (subsequently referred to as cores) per socket. The test is bt.B of the
> OpenMP version of the NAS benchmark suite. Initially there is one
> thread per core. NUMA balancing occurs, resulting in a move, and thus 31
> threads on one socket and 33 on the other.
>
> Load balancing should result in the idle core pulling one of the threads
> from the other socket. But that doesn't happen in normal load balancing,
> because all 33 threads on the overloaded socket are considered to have a
> preference for that socket. Active balancing could pull a thread, but it
> is not triggered because the idle core is seen as being newly idle.
>
> The question is then why a core that has been idle for up to multiple
> seconds is continually seen as newly idle. Every 4ms, a scheduler tick
> submits some work to try to load balance. This submission process
> previously broke out of the idle loop due to a need_resched, hence the
> same issue as involved in this patch series. The need_resched caused
> invocation of schedule, which would then see that there was no task to
> pick, making the core be considered to be newly idle. The classification
> as newly idle doesn't take into account whether any task was running prior
> to the call to schedule.
>
> The load balancing work that was submitted every 4ms is also a NOP due a
> test for need_resched.
>
> This patch series no longer makes need resched be the only way out of the
> idle loop. Without the need resched, the load balancing work that is
> submitted every 4ms can actually try to do load balancing. The core is
> not newly idle, so active balancing could in principle occur. But now
> nothing happens because the work is run by ksoftirqd. The presence of
> ksoftirqd on the idle core means that the core is no longer idle. Thus
> there is no more need for load balancing.

Thinking slightly ahead, assuming that the idle balancer realizes
the ksoftirqd is running for load balancing itself and discounts it
from consideration, won't the NUMA_IMBALANCE_MIN considered by
adjust_numa_imbalance() continue to keep the 33-31 distribution?

In both task_numa_find_cpu() [1] and calculate_imbalance() [2],
even though the scheduler classifies the local group as
"group_has_spare", with the imbalance <= NUMA_IMBALANCE_MIN which is 2,
a migration across the NUMA domains is still restricted I believe.

Can you try setting NUMA_IMBALANCE_MIN to 0 and checking if the
situation changes with the upstream kernel? I'm hoping the newidle
balance that is triggered without this series on the way to idle, is
good enough to pull the task towards itself.

Please ignore if you've already tried this. I might have missed it
when going through the original thread.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/fair.c?h=v6.8&id=e8f897f4afef0031fe618a8e94127a0934896aba#n2368
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/fair.c?h=v6.8&id=e8f897f4afef0031fe618a8e94127a0934896aba#n10743

>
> [snip..]
>

--
Thanks and Regards,
Prateek