[RFC PATCH 6/7] sched/fair: Pull from a relatively busy group during newidle balance

From: Chen Yu
Date: Thu Jul 27 2023 - 02:39:59 EST


Scanning the whole sched domain to find the busiest group is time
costly during newidle_balance() on a high core count system.

Introduce ILB_FAST to lower the bar during the busiest group
scanning. If the target sched group is relatively busier than the
local group, terminate the scan and try to pull from that group
directly.

Compared between ILB_UTIL and ILB_FAST, the former inhibits the
sched group scan when the system is busy. While the latter
choose a compromised busy group when the system is not busy.
So they are complementary to each other and work independently.

Tested on top of v6.5-rc2,
Sapphire Rapids with 2 x 56C/112T = 224 CPUs.
With cpufreq governor set to performance, and C6 disabled.

Firstly, tested on an extreme synthetic test[1] borrowed from
Tianyou. It launches 224 process. Each process is a loop of
nanosleep(1 us), which is supposed to trigger newidle balance
frequently:

i=1;while [ $i -le "224" ]; do ./nano_sleep 1000 & i=$(($i+1)); done;

[ILB_SNAPSHOT + NO_ILB_UTIL + NO_ILB_FAST]
Check the /proc/schedstat delta on CPU8 within 5 seconds using
the following script[2] by running: schedstat.py -i 5 -c 8

Mon Jul 24 23:43:43 2023 cpu8
.domain0.CPU_IDLE.lb_balanced 843
.domain0.CPU_IDLE.lb_count 843
.domain0.CPU_IDLE.lb_nobusyg 843
.domain0.CPU_IDLE.lb_sg_scan 843
.domain0.CPU_NEWLY_IDLE.lb_balanced 836
.domain0.CPU_NEWLY_IDLE.lb_count 837
.domain0.CPU_NEWLY_IDLE.lb_gained 1
.domain0.CPU_NEWLY_IDLE.lb_imbalance 1
.domain0.CPU_NEWLY_IDLE.lb_nobusyg 836
.domain0.CPU_NEWLY_IDLE.lb_sg_scan 837
.domain1.CPU_IDLE.lb_balanced 41
.domain1.CPU_IDLE.lb_count 41
.domain1.CPU_IDLE.lb_nobusyg 39
.domain1.CPU_IDLE.lb_sg_scan 2145
.domain1.CPU_NEWLY_IDLE.lb_balanced 732 <-----
.domain1.CPU_NEWLY_IDLE.lb_count 822 <-----
.domain1.CPU_NEWLY_IDLE.lb_failed 90
.domain1.CPU_NEWLY_IDLE.lb_imbalance 90
.domain1.CPU_NEWLY_IDLE.lb_nobusyg 497
.domain1.CPU_NEWLY_IDLE.lb_nobusyq 235
.domain1.CPU_NEWLY_IDLE.lb_sg_scan 45210 <-----
.domain1.ttwu_wake_remote 626
.domain2.CPU_IDLE.lb_balanced 15
.domain2.CPU_IDLE.lb_count 15
.domain2.CPU_NEWLY_IDLE.lb_balanced 635
.domain2.CPU_NEWLY_IDLE.lb_count 655
.domain2.CPU_NEWLY_IDLE.lb_failed 20
.domain2.CPU_NEWLY_IDLE.lb_imbalance 40
.domain2.CPU_NEWLY_IDLE.lb_nobusyg 633
.domain2.CPU_NEWLY_IDLE.lb_nobusyq 2
.domain2.CPU_NEWLY_IDLE.lb_sg_scan 655
.stats.rq_cpu_time 227910772
.stats.rq_sched_info.pcount 89393
.stats.rq_sched_info.run_delay 2145671
.stats.sched_count 178783
.stats.sched_goidle 89390
.stats.ttwu_count 89392
.stats.ttwu_local 88766

For domain1, there are 822 newidle balance attempt, and
the total number of groups scanned is 45210, thus each
balance would scan for 55 groups. During this 822 balance,
732 becomes(or are already) balanced, so the effect balance
success ratio is (822 - 732) / 822 = 10.94%

The perf:
9.38% 0.45% [kernel.kallsyms] [k] newidle_balance
6.84% 5.32% [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0

[ILB_SNAPSHOT + NO_ILB_UTIL + ILB_FAST]
Mon Jul 24 23:43:50 2023 cpu8
.domain0.CPU_IDLE.lb_balanced 918
.domain0.CPU_IDLE.lb_count 918
.domain0.CPU_IDLE.lb_nobusyg 918
.domain0.CPU_IDLE.lb_sg_scan 918
.domain0.CPU_NEWLY_IDLE.lb_balanced 1536
.domain0.CPU_NEWLY_IDLE.lb_count 1545
.domain0.CPU_NEWLY_IDLE.lb_failed 1
.domain0.CPU_NEWLY_IDLE.lb_gained 8
.domain0.CPU_NEWLY_IDLE.lb_imbalance 9
.domain0.CPU_NEWLY_IDLE.lb_nobusyg 1536
.domain0.CPU_NEWLY_IDLE.lb_sg_scan 1545
.domain1.CPU_IDLE.lb_balanced 45
.domain1.CPU_IDLE.lb_count 45
.domain1.CPU_IDLE.lb_nobusyg 43
.domain1.CPU_IDLE.lb_sg_scan 2365
.domain1.CPU_NEWLY_IDLE.lb_balanced 1196 <------
.domain1.CPU_NEWLY_IDLE.lb_count 1496 <------
.domain1.CPU_NEWLY_IDLE.lb_failed 296
.domain1.CPU_NEWLY_IDLE.lb_gained 4
.domain1.CPU_NEWLY_IDLE.lb_imbalance 301
.domain1.CPU_NEWLY_IDLE.lb_nobusyg 1182
.domain1.CPU_NEWLY_IDLE.lb_nobusyq 14
.domain1.CPU_NEWLY_IDLE.lb_sg_scan 30127 <------
.domain1.ttwu_wake_remote 2688
.domain2.CPU_IDLE.lb_balanced 13
.domain2.CPU_IDLE.lb_count 13
.domain2.CPU_NEWLY_IDLE.lb_balanced 898
.domain2.CPU_NEWLY_IDLE.lb_count 904
.domain2.CPU_NEWLY_IDLE.lb_failed 6
.domain2.CPU_NEWLY_IDLE.lb_imbalance 11
.domain2.CPU_NEWLY_IDLE.lb_nobusyg 896
.domain2.CPU_NEWLY_IDLE.lb_nobusyq 2
.domain2.CPU_NEWLY_IDLE.lb_sg_scan 904
.stats.rq_cpu_time 239830575
.stats.rq_sched_info.pcount 90879
.stats.rq_sched_info.run_delay 2436461
.stats.sched_count 181732
.stats.sched_goidle 90853
.stats.ttwu_count 90880
.stats.ttwu_local 88192

With ILB_FAST enabled, the CPU_NEWLY_IDLE in domain1 on CPU8
is 1496, and the total number of groups scanned is 30127. For
each load balance, it will scan for 20 groups, which is only
half of the 56 groups in a domain. During this 1496 balance,
1196 are balanced, so the effect balance success ratio
is (1496 - 1196) / 1496 = 20.95%, which is higher than 10.94%
when ILB_FAST is disabled.

perf profile:

2.95% 0.38% [kernel.kallsyms] [k] newidle_balance
2.00% 1.51% [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0

With ILB_FAST enabled, the total update_sd_lb_stats() has dropped a lot.

More benchmark results are shown below.
Baseline is ILB_SNAPSHOT + NO_ILB_UTIL, to compare with
ILB_SNAPSHOT + NO_ILB_UTIL + ILB_FAST

[netperf]
Launches nr instances of:
netperf -4 -H 127.0.0.1 -t $work_mode -c -C -l 100 &

nr: 56, 112, 168, 224, 280, 336, 392, 448
work_mode: TCP_RR UDP_RR

throughput
=======
case load baseline(std%) compare%( std%)
TCP_RR 56-threads 1.00 ( 1.83) +4.25 ( 5.15)
TCP_RR 112-threads 1.00 ( 2.19) +0.96 ( 2.84)
TCP_RR 168-threads 1.00 ( 1.92) -0.04 ( 2.11)
TCP_RR 224-threads 1.00 ( 1.98) -0.03 ( 1.76)
TCP_RR 280-threads 1.00 ( 63.11) -7.59 ( 62.46)
TCP_RR 336-threads 1.00 ( 18.44) -0.45 ( 19.81)
TCP_RR 392-threads 1.00 ( 26.49) -0.09 ( 30.85)
TCP_RR 448-threads 1.00 ( 40.47) -0.28 ( 39.71)
UDP_RR 56-threads 1.00 ( 1.83) -0.31 ( 2.08)
UDP_RR 112-threads 1.00 ( 13.77) +3.58 ( 3.22)
UDP_RR 168-threads 1.00 ( 10.97) -0.08 ( 11.77)
UDP_RR 224-threads 1.00 ( 12.83) -0.04 ( 14.03)
UDP_RR 280-threads 1.00 ( 13.89) +0.35 ( 16.83)
UDP_RR 336-threads 1.00 ( 24.91) +1.38 ( 22.57)
UDP_RR 392-threads 1.00 ( 34.86) -0.91 ( 33.89)
UDP_RR 448-threads 1.00 ( 40.63) +0.70 ( 44.18)

[tbench]
tbench -t 100 $job 127.0.0.1
job: 56, 112, 168, 224, 280, 336, 392, 448

throughput
======
case load baseline(std%) compare%( std%)
loopback 56-threads 1.00 ( 0.89) +1.51 ( 2.20)
loopback 112-threads 1.00 ( 0.03) +1.15 ( 0.29)
loopback 168-threads 1.00 ( 53.55) -37.92 ( 0.02)
loopback 224-threads 1.00 ( 61.24) -43.18 ( 0.01)
loopback 280-threads 1.00 ( 0.04) +0.33 ( 0.08)
loopback 336-threads 1.00 ( 0.35) +0.40 ( 0.17)
loopback 392-threads 1.00 ( 0.61) +0.49 ( 0.14)
loopback 448-threads 1.00 ( 0.08) +0.01 ( 0.24)

[schbench]
schbench -m $job -t 56 -r 30
job: 1, 2, 4, 8
3 iterations

99.0th latency
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 0.56) -0.45 ( 0.32)
normal 2-mthreads 1.00 ( 0.95) +1.01 ( 3.45)
normal 4-mthreads 1.00 ( 4.04) -0.60 ( 1.26)

[hackbench]
hackbench -g $job --$work_type --pipe -l 200000 -s 100 -f 28
and
hackbench -g $job --$work_type -l 200000 -s 100 -f 28

job: 1, 2, 4, 8
work_type: process threads

throughput
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 0.20) +2.30 ( 0.26)
process-pipe 2-groups 1.00 ( 3.53) +6.14 ( 2.45)
process-pipe 4-groups 1.00 ( 1.07) -4.58 ( 2.58)
process-sockets 1-groups 1.00 ( 0.36) +0.75 ( 1.22)
process-sockets 2-groups 1.00 ( 0.84) +1.26 ( 1.11)
process-sockets 4-groups 1.00 ( 0.04) +0.97 ( 0.11)
threads-pipe 1-groups 1.00 ( 3.62) +3.22 ( 2.64)
threads-pipe 2-groups 1.00 ( 4.17) +5.85 ( 7.53)
threads-pipe 4-groups 1.00 ( 5.30) -4.14 ( 5.39)
threads-sockets 1-groups 1.00 ( 0.40) +3.50 ( 3.13)
threads-sockets 2-groups 1.00 ( 2.54) +1.79 ( 0.80)
threads-sockets 4-groups 1.00 ( 0.05) +1.33 ( 0.03)

Considering the std%, there is no much score difference noticed.
It probably indicates that ILB_FAST has reduced the cost of newidle
balance without hurting the performance.

Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/stress_nanosleep.c #1
Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/schedstat.py #2
Suggested-by: Tim Chen <tim.c.chen@xxxxxxxxx>
Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
---
kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
2 files changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e360ed16e14..9af57b5a24dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10182,6 +10182,36 @@ static void update_ilb_group_scan(struct lb_env *env,
WRITE_ONCE(sd_share->nr_sg_scan, (int)nr_scan);
}

+static bool can_pull_busiest(struct sg_lb_stats *local,
+ struct sg_lb_stats *busiest)
+{
+ /*
+ * Check if the local group can pull from the 'busiest'
+ * group directly. When reaching here, update_sd_pick_busiest()
+ * has already filtered a candidate.
+ * The scan in newidle load balance on high core count system
+ * is costly, thus provide this shortcut to find a relative busy
+ * group rather than the busiest one.
+ *
+ * Only enable this shortcut when the local group is quite
+ * idle. This is because the total cost of newidle_balance()
+ * becomes severe when multiple CPUs fall into idle and launch
+ * newidle_balance() concurrently. And that usually indicates
+ * a group_has_spare status.
+ */
+ if (local->group_type != group_has_spare)
+ return false;
+
+ if (busiest->idle_cpus > local->idle_cpus)
+ return false;
+
+ if (busiest->idle_cpus == local->idle_cpus &&
+ busiest->sum_nr_running <= local->sum_nr_running)
+ return false;
+
+ return true;
+}
+
/**
* update_sd_lb_stats - Update sched_domain's statistics for load balancing.
* @env: The load balancing environment.
@@ -10226,6 +10256,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
sds->busiest_stat = *sgs;
+ /*
+ * Check if this busiest group can be pulled by the
+ * local group directly.
+ */
+ if (sched_feat(ILB_FAST) && ilb_can_load &&
+ can_pull_busiest(local, sgs))
+ goto load_snapshot;
}

if (sched_feat(ILB_UTIL) && ilb_can_load && --nr_sg_scan <= 0)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 30f6d1a2f235..4d67e0abb78c 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -104,3 +104,4 @@ SCHED_FEAT(BASE_SLICE, true)

SCHED_FEAT(ILB_SNAPSHOT, true)
SCHED_FEAT(ILB_UTIL, true)
+SCHED_FEAT(ILB_FAST, true)
--
2.25.1