Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

From: Chen Yu
Date: Wed Jul 12 2023 - 13:21:15 EST


On 2023-07-08 at 21:17:10 +0800, Chen Yu wrote:
> On 2023-07-05 at 13:57:02 +0200, Peter Zijlstra wrote:
> > On Fri, Jun 16, 2023 at 12:04:48PM +0530, K Prateek Nayak wrote:
> >
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -596,7 +596,7 @@ static inline int x86_sched_itmt_flags(v
> > #ifdef CONFIG_SCHED_MC
> > static int x86_core_flags(void)
> > {
> > - return cpu_core_flags() | x86_sched_itmt_flags();
> > + return cpu_core_flags() | x86_sched_itmt_flags() | SD_IDLE_SIBLING;
> > }
> I guess this flag might need to be added into the valid mask:
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d3a3b2646ec4..4a563e9f7b10 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1540,6 +1540,7 @@ static struct cpumask ***sched_domains_numa_masks;
> #define TOPOLOGY_SD_FLAGS \
> (SD_SHARE_CPUCAPACITY | \
> SD_SHARE_PKG_RESOURCES | \
> + SD_IDLE_SIBLING | \
> SD_NUMA | \
> SD_ASYM_PACKING)
> > #endif
> > #ifdef CONFIG_SCHED_SMT
> > --- a/include/linux/sched/sd_flags.h
> > +++ b/include/linux/sched/sd_flags.h
> > @@ -161,3 +161,10 @@ SD_FLAG(SD_OVERLAP, SDF_SHARED_PARENT |
> > * NEEDS_GROUPS: No point in preserving domain if it has a single group.
> > */
> > SD_FLAG(SD_NUMA, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
> > +
> > +/*
> > + * Search for idle CPUs in sibling groups
> > + *
> > + * NEEDS_GROUPS: Load balancing flag.
> > + */
> > +SD_FLAG(SD_IDLE_SIBLING, SDF_NEEDS_GROUPS)
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7046,6 +7046,38 @@ static int select_idle_cpu(struct task_s
> > }
> >
> > /*
> > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > + * local LLC comes up empty.
> > + */
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > + struct sched_domain *parent = sd->parent;
> > + struct sched_group *sg;
> > +
> > + /* Make sure to not cross nodes. */
> > + if (!parent || parent->flags & SD_NUMA)
> > + return -1;
> > +
> > + sg = parent->groups;
> > + do {
> > + int cpu = cpumask_first(sched_group_span(sg));
> > + struct sched_domain *sd_child = per_cpu(sd_llc, cpu);
> >
> I wonder if we can use rcu_dereference() in case the cpu hotplug
> changes the content sd_llc points to. (I'm still thinking of the
> symptom you described here:)
> https://lore.kernel.org/lkml/20230605190746.GX83892@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>
> I'll launch some tests with this version on Sapphire Rapids(and with/without LLC-split hack patch).

Tested on Sapphire Rapids, which has 2 x 56C/112T and 224 CPUs in total. C-states
deeper than C1E are disabled. Turbo is disabled. CPU frequency governor is performance.

The baseline is v6.4-rc1 tip:sched/core, on top of
commit 637c9509f3db ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()")

patch0: this SD_IDLE_SIBLING patch with above change to TOPOLOGY_SD_FLAGS
patch1: hack patch to split 1 LLC domain into 4 smaller LLC domains(with some fixes on top of
https://lore.kernel.org/lkml/ZJKjvx%2FNxooM5z1Y@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/)
The test data in above link is invalid due to bugs in the hack patch, fixed in this version)


Baseline vs Baseline+patch0:
There is no much difference between the two, and it is expected because Sapphire Rapids
does not have multiple LLC domains within 1 Numa node(also consider the run to run variation):

hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 2.66) +13.84 ( 12.80)
process-pipe 2-groups 1.00 ( 3.67) -8.37 ( 2.33)
process-pipe 4-groups 1.00 ( 6.45) +4.17 ( 6.36)
process-pipe 8-groups 1.00 ( 1.69) +2.28 ( 1.72)
process-sockets 1-groups 1.00 ( 1.73) +0.61 ( 0.69)
process-sockets 2-groups 1.00 ( 2.68) -2.20 ( 0.55)
process-sockets 4-groups 1.00 ( 0.03) -0.34 ( 0.17)
process-sockets 8-groups 1.00 ( 0.09) -0.28 ( 0.09)
threads-pipe 1-groups 1.00 ( 2.42) +6.95 ( 3.86)
threads-pipe 2-groups 1.00 ( 2.26) +2.68 ( 6.56)
threads-pipe 4-groups 1.00 ( 5.08) +3.57 ( 4.61)
threads-pipe 8-groups 1.00 ( 7.89) -2.52 ( 3.45)
threads-sockets 1-groups 1.00 ( 1.15) +0.87 ( 3.13)
threads-sockets 2-groups 1.00 ( 0.63) -0.02 ( 1.27)
threads-sockets 4-groups 1.00 ( 0.27) +0.29 ( 0.17)
threads-sockets 8-groups 1.00 ( 0.07) -0.42 ( 0.40)

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 56-threads 1.00 ( 2.56) -0.25 ( 3.27)
TCP_RR 112-threads 1.00 ( 2.26) +0.04 ( 2.18)
TCP_RR 168-threads 1.00 ( 0.81) +0.01 ( 0.74)
TCP_RR 224-threads 1.00 ( 0.65) +0.04 ( 0.66)
TCP_RR 280-threads 1.00 ( 64.56) +69.47 ( 56.78)
TCP_RR 336-threads 1.00 ( 20.39) +0.08 ( 19.58)
TCP_RR 392-threads 1.00 ( 31.63) +0.17 ( 31.08)
TCP_RR 448-threads 1.00 ( 39.72) -0.14 ( 39.14)
UDP_RR 56-threads 1.00 ( 8.94) -0.71 ( 12.03)
UDP_RR 112-threads 1.00 ( 18.72) +0.78 ( 16.71)
UDP_RR 168-threads 1.00 ( 11.39) -0.18 ( 8.34)
UDP_RR 224-threads 1.00 ( 9.02) +0.81 ( 11.47)
UDP_RR 280-threads 1.00 ( 15.87) -0.12 ( 12.87)
UDP_RR 336-threads 1.00 ( 39.89) +2.25 ( 32.35)
UDP_RR 392-threads 1.00 ( 28.17) +3.47 ( 25.99)
UDP_RR 448-threads 1.00 ( 58.68) +0.35 ( 56.16)

tbench
======
case load baseline(std%) compare%( std%)
loopback 56-threads 1.00 ( 0.94) +0.24 ( 0.69)
loopback 112-threads 1.00 ( 0.19) +0.18 ( 0.25)
loopback 168-threads 1.00 ( 52.17) -1.42 ( 50.95)
loopback 224-threads 1.00 ( 0.86) -0.38 ( 0.19)
loopback 280-threads 1.00 ( 0.12) -0.28 ( 0.17)
loopback 336-threads 1.00 ( 0.10) -0.33 ( 0.19)
loopback 392-threads 1.00 ( 0.27) -0.49 ( 0.26)
loopback 448-threads 1.00 ( 0.06) -0.88 ( 0.59)

schbench
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 0.72) -1.47 ( 0.41)
normal 2-mthreads 1.00 ( 1.66) +1.18 ( 2.63)
normal 4-mthreads 1.00 ( 1.12) +1.20 ( 4.52)
normal 8-mthreads 1.00 ( 11.03) -3.87 ( 5.14)


Baseline+patch1 vs Baseline+patch0+patch1:

With multiple LLC domains in 1 Numa node, SD_IDLE_SIBLING brings improvement
to hackbench/schbench, while brings downgrading to netperf/tbench. This is aligned
with what was observed previously, if the waker and wakee wakes up each other
frequently, they would like to be put together for cache locality. While for
other tasks do not have shared resource, always choosing an idle CPU is better.
Maybe in the future we can look back at SIS_SHORT and terminates scan in
select_idle_node() if the waker and wakee have close relationship with
each other.


hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 0.25) +31.65 ( 6.77)
process-pipe 2-groups 1.00 ( 0.28) +29.50 ( 5.35)
process-pipe 4-groups 1.00 ( 0.08) +16.77 ( 1.30)
process-pipe 8-groups 1.00 ( 0.20) -5.18 ( 0.04)
process-sockets 1-groups 1.00 ( 0.23) +13.68 ( 1.28)
process-sockets 2-groups 1.00 ( 0.16) +11.18 ( 1.87)
process-sockets 4-groups 1.00 ( 0.23) -0.06 ( 0.21)
process-sockets 8-groups 1.00 ( 0.36) +2.34 ( 0.15)
threads-pipe 1-groups 1.00 ( 5.23) +16.38 ( 12.10)
threads-pipe 2-groups 1.00 ( 1.63) +28.52 ( 5.17)
threads-pipe 4-groups 1.00 ( 0.77) +23.28 ( 2.42)
threads-pipe 8-groups 1.00 ( 2.27) +2.35 ( 5.75)
threads-sockets 1-groups 1.00 ( 2.31) +0.42 ( 1.68)
threads-sockets 2-groups 1.00 ( 0.56) +3.98 ( 0.65)
threads-sockets 4-groups 1.00 ( 0.12) +0.29 ( 0.32)
threads-sockets 8-groups 1.00 ( 0.86) +1.92 ( 0.27)

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 56-threads 1.00 ( 12.46) -1.62 ( 12.14)
TCP_RR 112-threads 1.00 ( 1.34) -0.16 ( 1.42)
TCP_RR 168-threads 1.00 ( 6.26) -0.88 ( 6.08)
TCP_RR 224-threads 1.00 ( 2.19) -90.18 ( 6.12)
TCP_RR 280-threads 1.00 ( 12.27) -63.81 ( 74.25)
TCP_RR 336-threads 1.00 ( 29.28) -6.21 ( 18.48)
TCP_RR 392-threads 1.00 ( 39.39) -3.87 ( 26.63)
TCP_RR 448-threads 1.00 ( 47.45) -2.34 ( 32.37)
UDP_RR 56-threads 1.00 ( 3.28) -0.31 ( 2.81)
UDP_RR 112-threads 1.00 ( 7.03) +0.55 ( 7.03)
UDP_RR 168-threads 1.00 ( 17.42) -0.51 ( 15.63)
UDP_RR 224-threads 1.00 ( 20.79) -68.28 ( 14.32)
UDP_RR 280-threads 1.00 ( 26.23) -68.58 ( 18.60)
UDP_RR 336-threads 1.00 ( 38.99) -0.55 ( 21.19)
UDP_RR 392-threads 1.00 ( 44.22) -1.91 ( 27.44)
UDP_RR 448-threads 1.00 ( 55.11) -2.74 ( 38.55)

tbench
======
case load baseline(std%) compare%( std%)
loopback 56-threads 1.00 ( 2.69) -2.30 ( 2.69)
loopback 112-threads 1.00 ( 1.92) +0.62 ( 1.46)
loopback 168-threads 1.00 ( 0.97) -67.69 ( 0.06)
loopback 224-threads 1.00 ( 0.24) -6.79 ( 8.81)
loopback 280-threads 1.00 ( 0.10) +0.47 ( 0.62)
loopback 336-threads 1.00 ( 0.85) -0.05 ( 0.05)
loopback 392-threads 1.00 ( 0.62) +0.77 ( 0.50)
loopback 448-threads 1.00 ( 0.36) +0.77 ( 0.77)

schbench
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 0.82) +1.44 ( 1.24)
normal 2-mthreads 1.00 ( 2.13) +1.16 ( 0.41)
normal 4-mthreads 1.00 ( 3.82) -0.30 ( 1.48)
normal 8-mthreads 1.00 ( 4.80) +22.43 ( 13.03)

But since the multiple LLC is just a simulation on Intel platform for now,
the patch is ok and:

Tested-by: Chen Yu <yu.c.chen@xxxxxxxxx>

thanks,
Chenyu