Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

From: Chen Yu
Date: Wed Jun 14 2023 - 10:59:00 EST


On 2023-06-14 at 10:17:57 +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 04:00:39PM +0530, K Prateek Nayak wrote:
>
> > >> - SIS_NODE_TOPOEXT - tip:sched/core + this patch
> > >> + new sched domain (Multi-Multi-Core or MMC)
> > >> (https://lore.kernel.org/all/20230601153522.GB559993@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/)
> > >> MMC domain groups 2 nearby CCX.
> > >
> > > OK, so you managed to get the NPS4 topology in NPS1 mode?
> >
> > Yup! But it is a hack. I'll leave the patch at the end.
>
> Chen Yu, could we do the reverse? Instead of building a bigger LLC
> domain, can we split our LLC based on SNC (sub-numa-cluster) topologies?
>
Hi Peter,
Do you mean with SNC enabled, if the LLC domain gets smaller?
According to the test, the answer seems to be yes.

SNC enabled:
grep . /sys/kernel/debug/sched/domains/cpu0/domain*/{name,flags}
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:MC
/sys/kernel/debug/sched/domains/cpu0/domain2/name:NUMA
/sys/kernel/debug/sched/domains/cpu0/domain3/name:NUMA
/sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
/sys/kernel/debug/sched/domains/cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
/sys/kernel/debug/sched/domains/cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA
/sys/kernel/debug/sched/domains/cpu0/domain3/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA

The MC domain1 has the SD_SHARE_PKG_RESOURCES flag, while
the sub-NUMA domain2 does not have it.

cat /proc/schedstat | grep cpu0 -A 4
cpu0 0 0 0 0 0 0 737153151491 189570367069 38103461
domain0 00000000,00000000,00000000,00010000,00000000,00000000,00000001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000fff,ffff0000,00000000,00000000,0fffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,000000ff,ffffffff,ffff0000,00000000,00ffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

SNC disabled:
grep . /sys/kernel/debug/sched/domains/cpu0/domain*/{name,flags}
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:MC
/sys/kernel/debug/sched/domains/cpu0/domain2/name:NUMA
/sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
/sys/kernel/debug/sched/domains/cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
/sys/kernel/debug/sched/domains/cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA


cat /proc/schedstat | grep cpu0 -A 4
cpu0 0 0 0 0 0 0 1030156602546 31734021553 1590389
domain0 00000000,00000000,00000000,00010000,00000000,00000000,00000001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,000000ff,ffffffff,ffff0000,00000000,00ffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 747972895325 29319670483 1420637

> Because as you know, Intel chips are having the reverse problem of the
> LLC being entirely too large, so perhaps we can break it up along the
> SNC lines.
>
> Could you see if that works?
I think spliting LLC would help. Or do you mean even with SNC disabled, are
we able to detect the SNC border? Currently the numa topo is detected from ACPI
table, not sure if SNC is disabled in BIOS, can the OS detect that. I'll have
a check.

thanks,
Chenyu