Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

From: Chen Yu
Date: Wed Jun 21 2023 - 03:16:59 EST

Next message: Linus Walleij: "Re: [PATCH] pinctrl: stm32: set default gpio line names using pin names"
Previous message: Miaohe Lin: "[PATCH] sched/fair: fix returning possible non-optimal candidate"
In reply to: K Prateek Nayak: "Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2023-06-14 at 17:13:48 +0200, Peter Zijlstra wrote:
> On Wed, Jun 14, 2023 at 10:58:20PM +0800, Chen Yu wrote:
> > On 2023-06-14 at 10:17:57 +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 13, 2023 at 04:00:39PM +0530, K Prateek Nayak wrote:
> > >
> > > > >> - SIS_NODE_TOPOEXT - tip:sched/core + this patch
> > > > >> + new sched domain (Multi-Multi-Core or MMC)
> > > > >> (https://lore.kernel.org/all/20230601153522.GB559993@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/)
> > > > >> MMC domain groups 2 nearby CCX.
> > > > >
> > > > > OK, so you managed to get the NPS4 topology in NPS1 mode?
> > > >
> > > > Yup! But it is a hack. I'll leave the patch at the end.
> > >
> > > Chen Yu, could we do the reverse? Instead of building a bigger LLC
> > > domain, can we split our LLC based on SNC (sub-numa-cluster) topologies?
> > >
> > Hi Peter,
> > Do you mean with SNC enabled, if the LLC domain gets smaller?
> > According to the test, the answer seems to be yes.
>
> No, I mean to build smaller LLC domains even with SNC disabled, as-if
> SNC were active.
>
>
The topology on Sapphire Rapids is that there are 4 memory controllers within
1 package per lstopo result, and the LLCs could have slightly difference distance
to the 4 mc with SNC disabled. Unfortunately there is no interface for the OS
to query this partition. I used a hack to split the LLC into 4 smaller ones
with SNC disabled, according to the topology in SNC4. Then I had a test on this
platform with/withouth this LLC split, both with SIS_NODE enabled and with
this issue fixed[1]. Something like this when iterating the groups in select_idle_node():

if (cpumask_test_cpu(target, sched_group_span(sg)))
continue;

The SIS_NODE should have no impact on non-LLC-split version on
Sapphire Rapids, so the baseline is vanilla+SIS_NODE.

In summary, huge improvement from netperf was observed, but also regression from
hackbench/schbench was observed when the system is under load. I'll collect some
schedstats to check the scan depth in the problematic cases.

With SNC disabled and with the hack llc-split patch applied, there is a new
Die domain generated, the LLC is divided into 4 sub-llc groups:

grep . domain*/{name,flags}
domain0/name:SMT
domain1/name:MC
domain2/name:DIE
domain3/name:NUMA
domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
domain3/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA

cat /proc/schedstat | grep cpu0 -A 4
cpu0 0 0 0 0 0 0 15968391465 3630455022 18084
domain0 00000000,00000000,00000000,00010000,00000000,00000000,00000001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,3fff0000,00000000,00000000,00003fff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,000000ff,ffffffff,ffff0000,00000000,00ffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 3.81) -100.18 ( 0.19)
process-pipe 2-groups 1.00 ( 10.74) -59.21 ( 0.91)
process-pipe 4-groups 1.00 ( 5.37) -56.37 ( 0.56)
process-pipe 8-groups 1.00 ( 0.36) +17.11 ( 0.82)
process-sockets 1-groups 1.00 ( 0.09) -26.53 ( 1.45)
process-sockets 2-groups 1.00 ( 0.82) -26.45 ( 0.40)
process-sockets 4-groups 1.00 ( 0.21) -4.09 ( 0.19)
process-sockets 8-groups 1.00 ( 0.13) -5.31 ( 0.36)
threads-pipe 1-groups 1.00 ( 2.14) -62.87 ( 1.11)
threads-pipe 2-groups 1.00 ( 3.18) -55.82 ( 1.14)
threads-pipe 4-groups 1.00 ( 4.68) -54.92 ( 0.34)
threads-pipe 8-groups 1.00 ( 5.08) +15.81 ( 3.08)
threads-sockets 1-groups 1.00 ( 2.60) -18.28 ( 6.03)
threads-sockets 2-groups 1.00 ( 0.83) -30.17 ( 0.60)
threads-sockets 4-groups 1.00 ( 0.16) -4.15 ( 0.27)
threads-sockets 8-groups 1.00 ( 0.36) -5.92 ( 0.94)

The 1 group, 2 groups, 4 groups suffered.

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 56-threads 1.00 ( 2.75) +10.49 ( 10.88)
TCP_RR 112-threads 1.00 ( 2.39) -1.88 ( 2.82)
TCP_RR 168-threads 1.00 ( 2.05) +8.31 ( 9.73)
TCP_RR 224-threads 1.00 ( 2.32) +788.25 ( 1.94)
TCP_RR 280-threads 1.00 ( 59.77) +83.07 ( 12.38)
TCP_RR 336-threads 1.00 ( 21.61) -0.22 ( 28.72)
TCP_RR 392-threads 1.00 ( 31.26) -0.13 ( 36.11)
TCP_RR 448-threads 1.00 ( 39.93) -0.14 ( 45.71)
UDP_RR 56-threads 1.00 ( 5.57) +2.38 ( 7.41)
UDP_RR 112-threads 1.00 ( 24.53) +1.51 ( 8.43)
UDP_RR 168-threads 1.00 ( 11.83) +7.34 ( 20.20)
UDP_RR 224-threads 1.00 ( 10.55) +163.81 ( 20.64)
UDP_RR 280-threads 1.00 ( 11.32) +176.04 ( 21.83)
UDP_RR 336-threads 1.00 ( 31.79) +12.87 ( 37.23)
UDP_RR 392-threads 1.00 ( 34.06) +15.64 ( 44.62)
UDP_RR 448-threads 1.00 ( 59.09) +14.00 ( 52.93)

The 224-thread/280-threads show good improvement.

tbench
======
case load baseline(std%) compare%( std%)
loopback 56-threads 1.00 ( 0.83) +1.38 ( 1.56)
loopback 112-threads 1.00 ( 0.19) -4.25 ( 0.90)
loopback 168-threads 1.00 ( 56.43) -31.12 ( 0.37)
loopback 224-threads 1.00 ( 0.28) -2.50 ( 0.44)
loopback 280-threads 1.00 ( 0.10) -1.64 ( 0.81)
loopback 336-threads 1.00 ( 0.19) -2.10 ( 0.10)
loopback 392-threads 1.00 ( 0.13) -2.15 ( 0.39)
loopback 448-threads 1.00 ( 0.45) -2.14 ( 0.43)

Might have no impact to tbench(the 168 threads result is unstable and could
be ignored)

schbench
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 0.42) -0.59 ( 0.72)
normal 2-mthreads 1.00 ( 2.72) +1.76 ( 0.42)
normal 4-mthreads 1.00 ( 0.75) -1.22 ( 1.86)
normal 8-mthreads 1.00 ( 6.44) -14.56 ( 5.64)

8 message case is not good for schbench.

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 352f0ce1ece4..ffc44639447e 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -511,6 +511,30 @@ static const struct x86_cpu_id intel_cod_cpu[] = {
{}
};

+static unsigned int sub_llc_nr;
+
+static int __init parse_sub_llc(char *str)
+{
+ get_option(&str, &sub_llc_nr);
+
+ return 0;
+}
+early_param("sub_llc_nr", parse_sub_llc);
+
+static bool
+topology_same_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+ int idx1, idx2;
+
+ if (!sub_llc_nr)
+ return true;
+
+ idx1 = c->apicid / sub_llc_nr;
+ idx2 = o->apicid / sub_llc_nr;
+
+ return idx1 == idx2;
+}
+
static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
{
const struct x86_cpu_id *id = x86_match_cpu(intel_cod_cpu);
@@ -530,7 +554,7 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
* means 'c' does not share the LLC of 'o'. This will be
* reflected to userspace.
*/
- if (match_pkg(c, o) && !topology_same_node(c, o) && intel_snc)
+ if (match_pkg(c, o) && (!topology_same_node(c, o) || !topology_same_llc(c, o)) && intel_snc)
return false;

return topology_sane(c, o, "llc");
--
2.25.1

[1] https://lore.kernel.org/lkml/5903fc0a-787e-9471-0256-77ff66f0bdef@xxxxxxxxxxxxx/

Next message: Linus Walleij: "Re: [PATCH] pinctrl: stm32: set default gpio line names using pin names"
Previous message: Miaohe Lin: "[PATCH] sched/fair: fix returning possible non-optimal candidate"
In reply to: K Prateek Nayak: "Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]