Re: [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance

From: Shrikanth Hegde
Date: Fri Jul 14 2023 - 09:04:41 EST




On 7/8/23 4:27 AM, Tim Chen wrote:
> From: Ricardo Neri <ricardo.neri-calderon@xxxxxxxxxxxxxxx>
>
> should_we_balance() traverses the group_balance_mask (AND'ed with lb_env::
> cpus) starting from lower numbered CPUs looking for the first idle CPU.
>
> In hybrid x86 systems, the siblings of SMT cores get CPU numbers, before
> non-SMT cores:
>
> [0, 1] [2, 3] [4, 5] 6 7 8 9
> b i b i b i b i i i
>
> In the figure above, CPUs in brackets are siblings of an SMT core. The
> rest are non-SMT cores. 'b' indicates a busy CPU, 'i' indicates an
> idle CPU.
>
> We should let a CPU on a fully idle core get the first chance to idle
> load balance as it has more CPU capacity than a CPU on an idle SMT
> CPU with busy sibling. So for the figure above, if we are running
> should_we_balance() to CPU 1, we should return false to let CPU 7 on
> idle core to have a chance first to idle load balance.
>
> A partially busy (i.e., of type group_has_spare) local group with SMT 
> cores will often have only one SMT sibling busy. If the destination CPU
> is a non-SMT core, partially busy, lower-numbered, SMT cores should not
> be considered when finding the first idle CPU. 
>
> However, in should_we_balance(), when we encounter idle SMT first in partially
> busy core, we prematurely break the search for the first idle CPU.
>
> Higher-numbered, non-SMT cores is not given the chance to have
> idle balance done on their behalf. Those CPUs will only be considered
> for idle balancing by chance via CPU_NEWLY_IDLE.
>
> Instead, consider the idle state of the whole SMT core.
>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@xxxxxxxxxxxxxxx>
> Co-developed-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f491b94908bf..294a662c9410 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10729,7 +10729,7 @@ static int active_load_balance_cpu_stop(void *data);
> static int should_we_balance(struct lb_env *env)
> {
> struct sched_group *sg = env->sd->groups;
> - int cpu;
> + int cpu, idle_smt = -1;
>
> /*
> * Ensure the balancing environment is consistent; can happen
> @@ -10756,10 +10756,24 @@ static int should_we_balance(struct lb_env *env)
> if (!idle_cpu(cpu))
> continue;
>
> + /*
> + * Don't balance to idle SMT in busy core right away when
> + * balancing cores, but remember the first idle SMT CPU for
> + * later consideration. Find CPU on an idle core first.
> + */
> + if (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) {
> + if (idle_smt == -1)
> + idle_smt = cpu;
> + continue;
> + }
> +
> /* Are we the first idle CPU? */
> return cpu == env->dst_cpu;
> }
>
> + if (idle_smt == env->dst_cpu)
> + return true;


This is nice. It helps in reducing the migrations and improve the performance
of CPU oriented benchmarks slightly. This could be due to less migrations.

Tested a bit on power10 with SMT=4. Offlined a CPU to make a few cores SMT=1. There is
no regression observed. Slight improvement in throughput oriented workload such as stress-ng.
migrations are reduced by quite a bit, likely due to patch. I have attached the test results there.
[4/6] sched/fair: Consider the idle state of the whole core for load balance

Test results:

# lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-17,24,25,32-49,56-89
Off-line CPU(s) list: 18-23,26-31,50-55,90-95
Model name: POWER10 (architected), altivec supported

--------------------------------------------------------------------------------------------------
baseline:

Performance counter stats for 'stress-ng --cpu=72 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):

260,813.13 msec task-clock # 33.390 CPUs utilized ( +- 0.10% )
42,535 context-switches # 163.543 /sec ( +- 0.13% )
9,060 cpu-migrations # 34.835 /sec ( +- 1.07% )
12,947 page-faults # 49.780 /sec ( +- 1.76% )
948,061,954,432 cycles # 3.645 GHz ( +- 0.09% )
926,045,701,578 instructions # 0.98 insn per cycle ( +- 0.00% )
146,418,075,496 branches # 562.964 M/sec ( +- 0.00% )
1,197,661,965 branch-misses # 0.82% of all branches ( +- 0.17% )

7.8111 +- 0.0162 seconds time elapsed ( +- 0.21% )

Performance counter stats for 'stress-ng --cpu=60 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):

253,351.70 msec task-clock # 28.207 CPUs utilized ( +- 0.21% )
41,046 context-switches # 162.828 /sec ( +- 0.16% )
6,674 cpu-migrations # 26.475 /sec ( +- 3.42% )
10,879 page-faults # 43.157 /sec ( +- 1.68% )
931,014,218,983 cycles # 3.693 GHz ( +- 0.22% )
919,717,564,454 instructions # 0.99 insn per cycle ( +- 0.00% )
145,480,596,331 branches # 577.116 M/sec ( +- 0.00% )
1,175,362,979 branch-misses # 0.81% of all branches ( +- 0.12% )

8.9818 +- 0.0288 seconds time elapsed ( +- 0.32% )


---------------------------------------------------------------------------------------------------
with patch:

Performance counter stats for 'stress-ng --cpu=72 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):

254,652.01 msec task-clock # 33.449 CPUs utilized ( +- 0.11% )
40,970 context-switches # 160.974 /sec ( +- 0.10% )
5,397 cpu-migrations # 21.205 /sec ( +- 2.01% )
11,705 page-faults # 45.990 /sec ( +- 1.21% )
911,115,537,080 cycles # 3.580 GHz ( +- 0.11% )
925,635,958,489 instructions # 1.02 insn per cycle ( +- 0.00% )
146,450,995,164 branches # 575.416 M/sec ( +- 0.00% )
1,188,906,011 branch-misses # 0.81% of all branches ( +- 0.28% )

7.6132 +- 0.0381 seconds time elapsed ( +- 0.50% )

Performance counter stats for 'stress-ng --cpu=60 -l 50 --cpu-ops=100000 --cpu-load-slice=1' (5 runs):

236,962.38 msec task-clock # 27.948 CPUs utilized ( +- 0.05% )
40,030 context-switches # 168.869 /sec ( +- 0.04% )
3,156 cpu-migrations # 13.314 /sec ( +- 1.37% )
9,448 page-faults # 39.857 /sec ( +- 1.72% )
856,444,937,794 cycles # 3.613 GHz ( +- 0.06% )
919,459,795,805 instructions # 1.07 insn per cycle ( +- 0.00% )
145,654,799,033 branches # 614.452 M/sec ( +- 0.00% )
1,177,464,719 branch-misses # 0.81% of all branches ( +- 0.23% )

8.4788 +- 0.0323 seconds time elapsed ( +- 0.38% )
--------------------------------------------------------------------------------------------------------

Tried on a symmetric system with all cores having SMT=4 as well. There was reduction in migrations here as well.
Didnt observe any major regressions when microbenchmarks run alone. Such as hackbench, stress-ng.

So. Here is tested-by.
Tested-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx>



> +
> /* Are we the first CPU of this group ? */
> return group_balance_cpu(sg) == env->dst_cpu;
> }

One doubt though, Here a fully idle core would be chosen instead of first idle cpu in the
group (if there is one). Since coming out of idle of SMT is faster compared to a fully idle core,
would latency increase? Or that concerns mainly wakeup path?