Re: [PATCH 0/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level

From: Shrikanth Hegde
Date: Thu Mar 28 2024 - 01:48:39 EST




On 3/28/24 6:17 AM, Vitalii Bursov wrote:
> Hi,
>
> During the upgrade from Linux 5.4 we found a small (around 3%)
> performance regression which was tracked to commit

You see the regression since it is doing more newidle balance?

> c5b0a7eefc70150caf23e37bc9d639c68c87a097
>
> sched/fair: Remove sysctl_sched_migration_cost condition
>
> With a default value of 500us, sysctl_sched_migration_cost is
> significanlty higher than the cost of load_balance. Remove the
> condition and rely on the sd->max_newidle_lb_cost to abort
> newidle_balance.
>
>
> Looks like "newidle" balancing is beneficial for a lot of workloads,
> just not for this specific one. The workload is video encoding, there
> are 100s-1000s of threads, some are synchonized with mutexes and

s/synchonized/synchronized/

> conditional variables. The process aims to have a portion of CPU idle,
> so no CPU cores are 100% busy. Perhaps, the performance impact we see
> comes from additional processing in the scheduler and additional cost
> like more cache misses, and not from an incorrect balancing. See
> perf output below.
>
> My understanding is that "sched_relax_domain_level" cgroup parameter
> should control if newidle_balance() is called and what's the scope

s/newidle_balance()/sched_balance_newidle() at all the places since the
name has been changed recently.

> of the balancing is, but it doesn't fully work for this case.
>
> cpusets.rst documentation:
>> The 'cpuset.sched_relax_domain_level' file allows you to request changing
>> this searching range as you like. This file takes int value which
>> indicates size of searching range in levels ideally as follows,
>> otherwise initial value -1 that indicates the cpuset has no request.
>>
>> ====== ===========================================================
>> -1 no request. use system default or follow request of others.
>> 0 no search.
>> 1 search siblings (hyperthreads in a core).
>> 2 search cores in a package.
>> 3 search cpus in a node [= system wide on non-NUMA system]
>> 4 search nodes in a chunk of node [on NUMA system]
>> 5 search system wide [on NUMA system]
>> ====== ===========================================================
>

I think this document needs to be updated. levels need not be serial order
due to sched domains degenation. It should have a paragraph which tells the user
to take a look at /sys/kernel/debug/sched/domains/cpu*/domain*/ for system
specific details.

> Setting cpuset.sched_relax_domain_level to 0 works as 1.
>
> On a dual-CPU server, domains and levels are as follows:
> domain 0: level 0, SMT
> domain 1: level 2, MC
> domain 2: level 5, NUMA
>
> So, to support "0 no search", the value in
> cpuset.sched_relax_domain_level should disable SD_BALANCE_NEWIDLE for a
> specified level and keep it enabled for prior levels. For example, SMT
> level is 0, so sched_relax_domain_level=0 should exclude levels >=0.
>
> Instead, cpuset.sched_relax_domain_level enables the specified level,
> which effectively removes "no search" option. See below for domain
> flags for all cpuset.sched_relax_domain_level values.
>
> Proposed patch allows clearing SD_BALANCE_NEWIDLE flags when
> cpuset.sched_relax_domain_level is set to 0 and extends max
> value validation range beyond sched_domain_level_max. This allows
> setting SD_BALANCE_NEWIDLE on all levels and override platform
> default if it does not include all levels.
>