[PATCH v2] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

From: K Prateek Nayak
Date: Wed Feb 09 2022 - 05:13:03 EST


Currently, sched/tip allows a generous amount of NUMA imbalance when it
comes to task wakeup. In case of processes affined to a subset of cpus
in the NUMA group, the current behavior can lead to piling of processes
on allowed cpus of one NUMA group when there exists an opportunity to
run it on a cpu of different NUMA group.
Mel Gorman's NUMA balance across LLCs v6 patchset [1] improves the
behavior of task placement on systems with smaller LLCs. However, there
still exists an opportunity to aggressively balance in case of processes
affined to a subset of group cpus.
A good placement makes a difference especially in case of short lived
tasks where the delay in load balancer kicking in can cause degradation
in performance.

Benchmark is performed using pinned run of STREAM, parallelized with OMP,
on a Zen3 machine. STREAM is configured to run 8 threads with CPU affinity
set to cpus 0,16,32,48,64,80,96,112.
This ensures an even distribution of allowed cpus across the NUMA groups
in NPS1, NPS2 and NPS4 modes.
The script running the stream itself is pinned to cpus 8-15 to maintain
consistency across runs and to make sure the script runs on an LLC
not part of stream cpulist so as to not interfere with the benchmark.

Changes are based on top of v6 of Mel's patchset
"Adjust NUMA imbalance for multiple LLCs" [1]

Following are the results:

5.17.0-rc1 5.17.0-rc1 5.17.0-rc1
tip sched/core tip sched/core tip sched/core
+ mel-v6 + mel-v6
+ this-fix

NPS Mode - NPS1

Copy: 93918.06 (0.00 pct) 109462.74 (16.55 pct) 159529.28 (69.86 pct)
Scale: 93587.15 (0.00 pct) 107532.66 (14.90 pct) 149857.47 (60.12 pct)
Add: 109049.25 (0.00 pct) 125020.15 (14.64 pct) 187370.68 (71.82 pct)
Triad: 110837.20 (0.00 pct) 120235.47 (8.47 pct) 184970.30 (66.88 pct)

NPS Mode - NPS2

Copy: 72897.93 (0.00 pct) 67735.80 (-7.08 pct) 158353.23 (117.22 pct)
Scale: 67053.02 (0.00 pct) 63938.28 (-4.64 pct) 151405.05 (125.79 pct)
Add: 82369.06 (0.00 pct) 79950.10 (-2.93 pct) 195779.90 (137.68 pct)
Triad: 86169.70 (0.00 pct) 83096.03 (-3.56 pct) 192829.44 (123.77 pct)

NPS Mode - NPS4

Copy: 47215.25 (0.00 pct) 76329.25 (61.66 pct) 166127.12 (251.85 pct)
Scale: 44749.85 (0.00 pct) 68847.86 (53.85 pct) 157443.02 (251.82 pct)
Add: 56184.15 (0.00 pct) 92570.93 (64.76 pct) 199190.89 (254.53 pct)
Triad: 52530.37 (0.00 pct) 88348.62 (68.18 pct) 197430.72 (275.84 pct)


The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in mel-v6 in NPS2 mode:

stream-5261 [016] d..2. 262.189413: sched_wakeup_new: comm=stream pid=5263 prio=120 target_cpu=000
stream-5261 [016] d..2. 262.189459: sched_wakeup_new: comm=stream pid=5264 prio=120 target_cpu=016
stream-5261 [016] d..2. 262.189568: sched_wakeup_new: comm=stream pid=5265 prio=120 target_cpu=016
stream-5261 [016] d..2. 262.189621: sched_wakeup_new: comm=stream pid=5266 prio=120 target_cpu=048
stream-5261 [016] d..2. 262.189678: sched_wakeup_new: comm=stream pid=5267 prio=120 target_cpu=032
stream-5261 [016] d..2. 262.189720: sched_wakeup_new: comm=stream pid=5268 prio=120 target_cpu=016
stream-5261 [016] d..2. 262.189758: sched_wakeup_new: comm=stream pid=5269 prio=120 target_cpu=016

Four stream threads pile up on cpu 16 initially and in case of
short runs, where the load balancer doesn't have enough time to kick in
to migrate task, performance might suffer. This pattern is observed
consistently where tasks pile on one cpu of the group where the
runner script is pinned to.

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks with this fix in NPS2 mode:

stream-5260 [016] d..2. 212.794629: sched_wakeup_new: comm=stream pid=5262 prio=120 target_cpu=000
stream-5260 [016] d..2. 212.794683: sched_wakeup_new: comm=stream pid=5263 prio=120 target_cpu=048
stream-5260 [016] d..2. 212.794789: sched_wakeup_new: comm=stream pid=5264 prio=120 target_cpu=032
stream-5260 [016] d..2. 212.794850: sched_wakeup_new: comm=stream pid=5265 prio=120 target_cpu=112
stream-5260 [016] d..2. 212.794903: sched_wakeup_new: comm=stream pid=5266 prio=120 target_cpu=096
stream-5260 [016] d..2. 212.794961: sched_wakeup_new: comm=stream pid=5267 prio=120 target_cpu=080
stream-5260 [016] d..2. 212.795018: sched_wakeup_new: comm=stream pid=5268 prio=120 target_cpu=064

The tasks have been distributed evenly across all the allowed cpus
and no pile up can be seen.


Aggressive NUMA balancing is only done when needed. We select the
minimum of number of allowed cpus in sched group and the calculated
sd.imb_numa_nr as our imbalance threshold and the default behavior
of mel-v5 is only modified when the former is smaller than
the latter.

This can help is case of embarrassingly parallel programs with tight
cpus affinity set.


[1] https://lore.kernel.org/lkml/20220208094334.16379-1-mgorman@xxxxxxxxxxxxxxxxxxx/

Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
---
Changelog v1-->v2:
- Rebase changes on top of v6 of Mel's
"Adjust NUMA imbalance for multiple LLCs" patchset
- Reuse select_idle_mask ptr to store result of cpumask_and based
on Mel's suggestion
---
kernel/sched/fair.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c4bfffe8c2c..6e875f1f34e2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9130,6 +9130,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)

case group_has_spare:
if (sd->flags & SD_NUMA) {
+ struct cpumask *cpus;
+ int imb;
#ifdef CONFIG_NUMA_BALANCING
int idlest_cpu;
/*
@@ -9147,10 +9149,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
* Otherwise, keep the task close to the wakeup source
* and improve locality if the number of running tasks
* would remain below threshold where an imbalance is
- * allowed. If there is a real need of migration,
- * periodic load balance will take care of it.
+ * allowed while accounting for the possibility the
+ * task is pinned to a subset of CPUs. If there is a
+ * real need of migration, periodic load balance will
+ * take care of it.
*/
- if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
+ cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
+ imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
+ if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
return NULL;
}

--
2.25.1