[RFC PATCH 5/7] sched/fair: Adjust the busiest group scanning depth in idle load balance

From: Chen Yu
Date: Thu Jul 27 2023 - 02:39:40 EST


Scanning the whole sched domain to find the busiest group is time costly
during newidle_balance(). And if a CPU becomes idle, it would be good
if this idle CPU pulls some tasks from other CPUs as quickly as possible.

Limit the scan depth of newidle_balance() to only scan for a limited number
of sched groups to find a relatively busy group, and pull from it.
In summary, the more spare time there is in the domain, the more time
each newidle balance can spend on scanning for a busy group. Although
the newidle balance has per domain max_newidle_lb_cost to decide
whether to launch the balance or not, the ILB_UTIL provides a smaller
granularity to decide how many groups each newidle balance can scan.

The scanning depth is calculated by the previous periodic load balance
based on its overall utilization.

Tested on top of v6.5-rc2, Sapphire Rapids with 2 x 56C/112T = 224 CPUs.
With cpufreq governor set to performance, and C6 disabled.

Firstly, tested on a extreme synthetic test[1], which launches 224
process. Each process is a loop of nanosleep(1 us), which is supposed
to trigger newidle balance as much as possible:

i=1;while [ $i -le "224" ]; do ./nano_sleep 1000 & i=$(($i+1)); done;

NO_ILB_UTIL + ILB_SNAPSHOT:
9.38% 0.45% [kernel.kallsyms] [k] newidle_balance
6.84% 5.32% [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0

ILB_UTIL + ILB_SNAPSHOT:
3.35% 0.38% [kernel.kallsyms] [k] newidle_balance
2.30% 1.81% [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0

With ILB_UTIL enabled, the total number of newidle_balance() and
update_sd_lb() drops. But the reason for why there are less newidle
balance has not been investigated. According to the low util_avg value
in /sys/kernel/debug/sched/debug, there should be no much impact
on the nanosleep stress test.

Test in a wider range:

[netperf]
Launches nr instances of:
netperf -4 -H 127.0.0.1 -t $work_mode -c -C -l 100 &

nr: 56, 112, 168, 224, 280, 336, 392, 448
work_mode: TCP_RR UDP_RR

throughput
=======
case load baseline(std%) compare%( std%)
TCP_RR 56-threads 1.00 ( 5.15) -3.96 ( 2.17)
TCP_RR 112-threads 1.00 ( 2.84) -0.82 ( 2.24)
TCP_RR 168-threads 1.00 ( 2.11) -0.03 ( 2.31)
TCP_RR 224-threads 1.00 ( 1.76) +0.01 ( 2.12)
TCP_RR 280-threads 1.00 ( 62.46) +56.56 ( 56.91)
TCP_RR 336-threads 1.00 ( 19.81) +0.27 ( 17.90)
TCP_RR 392-threads 1.00 ( 30.85) +0.13 ( 29.09)
TCP_RR 448-threads 1.00 ( 39.71) -18.82 ( 45.93)
UDP_RR 56-threads 1.00 ( 2.08) -0.31 ( 7.89)
UDP_RR 112-threads 1.00 ( 3.22) -0.50 ( 15.19)
UDP_RR 168-threads 1.00 ( 11.77) +0.37 ( 10.30)
UDP_RR 224-threads 1.00 ( 14.03) +0.25 ( 12.88)
UDP_RR 280-threads 1.00 ( 16.83) -0.57 ( 15.34)
UDP_RR 336-threads 1.00 ( 22.57) +0.01 ( 24.68)
UDP_RR 392-threads 1.00 ( 33.89) +2.65 ( 33.89)
UDP_RR 448-threads 1.00 ( 44.18) +0.81 ( 41.28)

Considering the std%, there is no much difference to netperf.

[tbench]
tbench -t 100 $job 127.0.0.1
job: 56, 112, 168, 224, 280, 336, 392, 448

throughput
======
case load baseline(std%) compare%( std%)
loopback 56-threads 1.00 ( 2.20) -0.09 ( 2.05)
loopback 112-threads 1.00 ( 0.29) -0.88 ( 0.10)
loopback 168-threads 1.00 ( 0.02) +62.92 ( 54.57)
loopback 224-threads 1.00 ( 0.05) +234.30 ( 1.81)
loopback 280-threads 1.00 ( 0.08) -0.11 ( 0.21)
loopback 336-threads 1.00 ( 0.17) -0.17 ( 0.08)
loopback 392-threads 1.00 ( 0.14) -0.09 ( 0.18)
loopback 448-threads 1.00 ( 0.24) -0.53 ( 0.55)

There are improvement of tbench in 224 threads case.

[hackbench]

hackbench -g $job --$work_type --pipe -l 200000 -s 100 -f 28
and
hackbench -g $job --$work_type -l 200000 -s 100 -f 28

job: 1, 2, 4, 8
work_type: process threads

throughput
==========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 0.20) +1.57 ( 0.58)
process-pipe 2-groups 1.00 ( 3.53) +2.99 ( 2.03)
process-pipe 4-groups 1.00 ( 1.07) +0.17 ( 1.64)
process-sockets 1-groups 1.00 ( 0.36) -0.04 ( 1.44)
process-sockets 2-groups 1.00 ( 0.84) +0.65 ( 1.65)
process-sockets 4-groups 1.00 ( 0.04) +0.89 ( 0.08)
threads-pipe 1-groups 1.00 ( 3.62) -0.53 ( 1.67)
threads-pipe 2-groups 1.00 ( 4.17) -4.79 ( 0.53)
threads-pipe 4-groups 1.00 ( 5.30) +5.06 ( 1.95)
threads-sockets 1-groups 1.00 ( 0.40) +1.44 ( 0.53)
threads-sockets 2-groups 1.00 ( 2.54) +2.21 ( 2.51)
threads-sockets 4-groups 1.00 ( 0.05) +1.29 ( 0.05)

No much difference of hackbench.

[schbench(old)]
schbench -m $job -t 56 -r 30
job: 1, 2, 4, 8
3 iterations

99.0th latency
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 0.56) -0.91 ( 0.32)
normal 2-mthreads 1.00 ( 0.95) -4.05 ( 3.63)
normal 4-mthreads 1.00 ( 4.04) -0.30 ( 2.35)

No much difference of schbench.

[Limitation]
In the previous version, Prateek reported a regression. That could be
due to the concurrent access across the Numa node, or ILB_UTIL did not
scan hard enough to pull from the busiest group. The former issue is
fixed by not enabling ILB_UTIL for Numa domain. If there is still
regression in this version, we can leverage the result of SIS_UTIL,
to provide a quadratic function rather than the linear function, to
scan harder when the system is idle.

Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/stress_nanosleep.c #1
Suggested-by: Tim Chen <tim.c.chen@xxxxxxxxx>
Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
---
kernel/sched/fair.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6925813db59b..4e360ed16e14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10195,7 +10195,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
unsigned long sum_util = 0;
- int sg_status = 0;
+ int sg_status = 0, nr_sg_scan;
+ /* only newidle CPU can load the snapshot */
+ bool ilb_can_load = env->idle == CPU_NEWLY_IDLE &&
+ sd_share && READ_ONCE(sd_share->total_capacity);
+
+ if (sched_feat(ILB_UTIL) && ilb_can_load)
+ nr_sg_scan = sd_share->nr_sg_scan;

do {
struct sg_lb_stats *sgs = &tmp_sgs;
@@ -10222,6 +10228,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
sds->busiest_stat = *sgs;
}

+ if (sched_feat(ILB_UTIL) && ilb_can_load && --nr_sg_scan <= 0)
+ goto load_snapshot;
+
next_group:
/* Now, start updating sd_lb_stats */
sds->total_load += sgs->group_load;
@@ -10231,6 +10240,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
sg = sg->next;
} while (sg != env->sd->groups);

+ ilb_can_load = false;
+
+load_snapshot:
+ if (ilb_can_load) {
+ /* borrow the statistic of previous periodic load balance */
+ sds->total_load = READ_ONCE(sd_share->total_load);
+ sds->total_capacity = READ_ONCE(sd_share->total_capacity);
+ }
+
/*
* Indicate that the child domain of the busiest group prefers tasks
* go to a child's sibling domains first. NB the flags of a sched group
--
2.25.1