[Patch v3 0/6] Enable Cluster Scheduling for x86 Hybrid CPUs

From: Tim Chen
Date: Fri Jul 07 2023 - 18:57:05 EST


This is the third version of patches to fix issues to allow cluster
scheduling on x86 hybrid CPUs. They address concerns raised by
Peter on the second version. Please refer to the cover letter in the
first version for the motivation behind this patch series.

Changes from v2:
1. Peter pointed out that biasing asym packing in sibling imbalance
computation is unnecessary. We will negate extra turbo headroom
advantage by concentrating tasks in the preferred group. In v3, we
simplify computing sibling imbalance only in proportion to the number
of cores, and remove asym packing bias. We do not lose any performance
and do a bit better than v2.

2. Peter asked the question of whether it is better to round the
sibling_imbalance() computation or floor the sibling_imbalanace()
as in the v2 implementation. I did find the rounding to be
better in threaded tensor computation, hence v3 adopt rounding
in sibling_imbalance(). The performance of both versions are
listed in the performance data below.

3. Fix patch 1 to take SMT thread number more than 2 into consideration.

4. Various style clean ups suggested by Peter.

Past Versions:
[v1] https://lore.kernel.org/lkml/CAKfTPtD1W6vJQBsNKEt_4tn2EeAs=73CeH4LoCwENrh2JUDwnQ@xxxxxxxxxxxxxx/T/
[v2] https://lore.kernel.org/all/cover.1686263351.git.tim.c.chen@xxxxxxxxxxxxxxx/

v3 Performance numbers:

This version
Single Threaded 6.3-rc5 with cluster Improvement Alternative Improvement
Benchmark Baseline scheduling in Performance implementation in Performance
(round imbalance) (floor imbalance)
(run-run deviation) (run-run deviation) (run-run deviation)
------------------------------------------------------------------------------------------------------------
tjbench (+/- 0.08%) (+/- 0.12%) 0.03% (+/- 0.11%) 0.00%
PhPbench (+/- 0.31%) (+/- 0.50%) +0.19% (+/- 0.87%) +0.21%
flac (+/- 0.58%) (+/- 0.41%) +0.48% (+/- 0.41%) +1.02%
pybench (+/- 3.16%) (+/- 2.87%) +2.04% (+/- 2.22%) +4.25%


This version
with cluster Improvement Alternative Improvement
Multi Threaded 6.3-rc5 scheduling in Performance implementation in Performance
Benchmark Baseline (round imbalance) (floor imbalance)
(-#threads) (run-run deviation) (run-run deviation) (run-run deviation)
------------------------------------------------------------------------------------------------------------
Kbuild-8 (+/- 2.90%) (+/- 0.23%) -1.10% (+/- 0.40%) -1.01%
Kbuild-10 (+/- 3.08%) (+/- 0.51%) -1.93% (+/- 0.49%) -1.57%
Kbuild-12 (+/- 3.28%) (+/- 0.39%) -1.10% (+/- 0.23%) -0.98%
Tensor Lite-8 (+/- 4.84%) (+/- 0.86%) -1.32% (+/- 0.58%) -0.78%
Tensor Lite-10 (+/- 0.87%) (+/- 0.30%) +0.68% (+/- 1.24%) -0.13%
Tensor Lite-12 (+/- 1.37%) (+/- 0.82%) +4.16% (+/- 1.65%) +1.19%

Tim


Peter Zijlstra (Intel) (1):
sched/debug: Dump domains' sched group flags

Ricardo Neri (1):
sched/fair: Consider the idle state of the whole core for load balance

Tim C Chen (4):
sched/fair: Determine active load balance for SMT sched groups
sched/topology: Record number of cores in sched group
sched/fair: Implement prefer sibling imbalance calculation between
asymmetric groups
sched/x86: Add cluster topology to hybrid CPU

arch/x86/kernel/smpboot.c | 3 +
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 10 ++-
5 files changed, 143 insertions(+), 9 deletions(-)

--
2.32.0