Re: [RFC PATCH] sched/fair: Skip idle CPU search on busy system

From: Shrikanth Hegde
Date: Thu Jul 27 2023 - 11:06:02 EST




On 7/27/23 12:55 PM, Chen Yu wrote:
> On 2023-07-26 at 15:06:12 +0530, Shrikanth Hegde wrote:
>> When the system is fully busy, there will not be any idle CPU's.
>> In that case, load_balance will be called mainly with CPU_NOT_IDLE
>> type. In should_we_balance its currently checking for an idle CPU if
>> one exist. When system is 100% busy, there will not be an idle CPU and
>> these idle_cpu checks can be skipped. This would avoid fetching those rq
>> structures.
>>
>
> Yes, I guess this could help reducing the cost if the sched group
> has many CPUs.

Thank you for the review Chen Yu.

>
>> This is a minor optimization for a specific case of 100% utilization.
>>
>> .....
>> Coming to the current implementation. It is a very basic approach to the
>> issue. It may not be the best/perfect way to this. It works only in
>> case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which
>> tracks idle CPU's. AFAIU there isn't any other. If there is such info, we
>> can use that instead. nohz.nr_cpus is atomic, which might be costly too.
>>
>> Alternative way would be to add a new attribute to sched_domain and update
>> it in cpu idle entry/exit path per CPU. Advantage is, check can be per
>> env->sd instead of global. Slightly complicated, but maybe better. there
>> could other advantage at wake up to limit the scan etc.
>>
>
> When checking the code, I found that there is per domain nr_busy_cpus.
> However that variable is only for LLC domain. Maybe extend the sd_share
> for domains under NUMA is applicable IMO.

True. I did see that. Doing at every level when there are large number
of CPU's will likely need lock when updating the sd_share and that would
be the bottleneck as well. Since sd_share never makes sense for NUMA,
This would cause different code check for NUMA and non-NUMA. Though main benefit
for this corner case would be in NUMA as there would be large number of CPU's there.

I will keep that thought and will try to work something along.

>
> thanks,
> Chenyu
>
>> Your feedback would really help. Does this optimization makes sense?
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx>
>> ---
>> kernel/sched/fair.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 373ff5f55884..903d59b5290c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env)
>> return 1;
>> }
>>
>> +#ifdef CONFIG_NO_HZ_COMMON
>> + /* If the system is fully busy, its better to skip the idle checks */
>> + if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0)
>> + return group_balance_cpu(sg) == env->dst_cpu;
>> +#endif
>> +
>> /* Try to find first idle CPU */
>> for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
>> if (!idle_cpu(cpu))
>> --
>> 2.31.1
>>