Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

From: K Prateek Nayak
Date: Wed Jul 12 2023 - 23:43:51 EST


Hello Chenyu,

On 7/12/2023 10:49 PM, Chen Yu wrote:
> On 2023-07-08 at 21:17:10 +0800, Chen Yu wrote:
>> On 2023-07-05 at 13:57:02 +0200, Peter Zijlstra wrote:
>>> On Fri, Jun 16, 2023 at 12:04:48PM +0530, K Prateek Nayak wrote:
>>>
>>> --- a/arch/x86/kernel/smpboot.c
>>> +++ b/arch/x86/kernel/smpboot.c
>>> @@ -596,7 +596,7 @@ static inline int x86_sched_itmt_flags(v
>>> #ifdef CONFIG_SCHED_MC
>>> static int x86_core_flags(void)
>>> {
>>> - return cpu_core_flags() | x86_sched_itmt_flags();
>>> + return cpu_core_flags() | x86_sched_itmt_flags() | SD_IDLE_SIBLING;
>>> }
>> I guess this flag might need to be added into the valid mask:
>>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index d3a3b2646ec4..4a563e9f7b10 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1540,6 +1540,7 @@ static struct cpumask ***sched_domains_numa_masks;
>> #define TOPOLOGY_SD_FLAGS \
>> (SD_SHARE_CPUCAPACITY | \
>> SD_SHARE_PKG_RESOURCES | \
>> + SD_IDLE_SIBLING | \
>> SD_NUMA | \
>> SD_ASYM_PACKING)
>>> #endif
>>> #ifdef CONFIG_SCHED_SMT
>>> --- a/include/linux/sched/sd_flags.h
>>> +++ b/include/linux/sched/sd_flags.h
>>> @@ -161,3 +161,10 @@ SD_FLAG(SD_OVERLAP, SDF_SHARED_PARENT |
>>> * NEEDS_GROUPS: No point in preserving domain if it has a single group.
>>> */
>>> SD_FLAG(SD_NUMA, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
>>> +
>>> +/*
>>> + * Search for idle CPUs in sibling groups
>>> + *
>>> + * NEEDS_GROUPS: Load balancing flag.
>>> + */
>>> +SD_FLAG(SD_IDLE_SIBLING, SDF_NEEDS_GROUPS)
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7046,6 +7046,38 @@ static int select_idle_cpu(struct task_s
>>> }
>>>
>>> /*
>>> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
>>> + * local LLC comes up empty.
>>> + */
>>> +static int
>>> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
>>> +{
>>> + struct sched_domain *parent = sd->parent;
>>> + struct sched_group *sg;
>>> +
>>> + /* Make sure to not cross nodes. */
>>> + if (!parent || parent->flags & SD_NUMA)
>>> + return -1;
>>> +
>>> + sg = parent->groups;
>>> + do {
>>> + int cpu = cpumask_first(sched_group_span(sg));
>>> + struct sched_domain *sd_child = per_cpu(sd_llc, cpu);
>>>
>> I wonder if we can use rcu_dereference() in case the cpu hotplug
>> changes the content sd_llc points to. (I'm still thinking of the
>> symptom you described here:)
>> https://lore.kernel.org/lkml/20230605190746.GX83892@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>>
>> I'll launch some tests with this version on Sapphire Rapids(and with/without LLC-split hack patch).
>
> Tested on Sapphire Rapids, which has 2 x 56C/112T and 224 CPUs in total. C-states
> deeper than C1E are disabled. Turbo is disabled. CPU frequency governor is performance.
>
> The baseline is v6.4-rc1 tip:sched/core, on top of
> commit 637c9509f3db ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()")
>
> patch0: this SD_IDLE_SIBLING patch with above change to TOPOLOGY_SD_FLAGS
> patch1: hack patch to split 1 LLC domain into 4 smaller LLC domains(with some fixes on top of
> https://lore.kernel.org/lkml/ZJKjvx%2FNxooM5z1Y@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/)
> The test data in above link is invalid due to bugs in the hack patch, fixed in this version)
>
>
> Baseline vs Baseline+patch0:
> There is no much difference between the two, and it is expected because Sapphire Rapids
> does not have multiple LLC domains within 1 Numa node(also consider the run to run variation):
>
> hackbench
> =========
> case load baseline(std%) compare%( std%)
> process-pipe 1-groups 1.00 ( 2.66) +13.84 ( 12.80)
> process-pipe 2-groups 1.00 ( 3.67) -8.37 ( 2.33)
> process-pipe 4-groups 1.00 ( 6.45) +4.17 ( 6.36)
> process-pipe 8-groups 1.00 ( 1.69) +2.28 ( 1.72)
> process-sockets 1-groups 1.00 ( 1.73) +0.61 ( 0.69)
> process-sockets 2-groups 1.00 ( 2.68) -2.20 ( 0.55)
> process-sockets 4-groups 1.00 ( 0.03) -0.34 ( 0.17)
> process-sockets 8-groups 1.00 ( 0.09) -0.28 ( 0.09)
> threads-pipe 1-groups 1.00 ( 2.42) +6.95 ( 3.86)
> threads-pipe 2-groups 1.00 ( 2.26) +2.68 ( 6.56)
> threads-pipe 4-groups 1.00 ( 5.08) +3.57 ( 4.61)
> threads-pipe 8-groups 1.00 ( 7.89) -2.52 ( 3.45)
> threads-sockets 1-groups 1.00 ( 1.15) +0.87 ( 3.13)
> threads-sockets 2-groups 1.00 ( 0.63) -0.02 ( 1.27)
> threads-sockets 4-groups 1.00 ( 0.27) +0.29 ( 0.17)
> threads-sockets 8-groups 1.00 ( 0.07) -0.42 ( 0.40)
>
> netperf
> =======
> case load baseline(std%) compare%( std%)
> TCP_RR 56-threads 1.00 ( 2.56) -0.25 ( 3.27)
> TCP_RR 112-threads 1.00 ( 2.26) +0.04 ( 2.18)
> TCP_RR 168-threads 1.00 ( 0.81) +0.01 ( 0.74)
> TCP_RR 224-threads 1.00 ( 0.65) +0.04 ( 0.66)
> TCP_RR 280-threads 1.00 ( 64.56) +69.47 ( 56.78)
> TCP_RR 336-threads 1.00 ( 20.39) +0.08 ( 19.58)
> TCP_RR 392-threads 1.00 ( 31.63) +0.17 ( 31.08)
> TCP_RR 448-threads 1.00 ( 39.72) -0.14 ( 39.14)
> UDP_RR 56-threads 1.00 ( 8.94) -0.71 ( 12.03)
> UDP_RR 112-threads 1.00 ( 18.72) +0.78 ( 16.71)
> UDP_RR 168-threads 1.00 ( 11.39) -0.18 ( 8.34)
> UDP_RR 224-threads 1.00 ( 9.02) +0.81 ( 11.47)
> UDP_RR 280-threads 1.00 ( 15.87) -0.12 ( 12.87)
> UDP_RR 336-threads 1.00 ( 39.89) +2.25 ( 32.35)
> UDP_RR 392-threads 1.00 ( 28.17) +3.47 ( 25.99)
> UDP_RR 448-threads 1.00 ( 58.68) +0.35 ( 56.16)
>
> tbench
> ======
> case load baseline(std%) compare%( std%)
> loopback 56-threads 1.00 ( 0.94) +0.24 ( 0.69)
> loopback 112-threads 1.00 ( 0.19) +0.18 ( 0.25)
> loopback 168-threads 1.00 ( 52.17) -1.42 ( 50.95)
> loopback 224-threads 1.00 ( 0.86) -0.38 ( 0.19)
> loopback 280-threads 1.00 ( 0.12) -0.28 ( 0.17)
> loopback 336-threads 1.00 ( 0.10) -0.33 ( 0.19)
> loopback 392-threads 1.00 ( 0.27) -0.49 ( 0.26)
> loopback 448-threads 1.00 ( 0.06) -0.88 ( 0.59)
>
> schbench
> ========
> case load baseline(std%) compare%( std%)
> normal 1-mthreads 1.00 ( 0.72) -1.47 ( 0.41)
> normal 2-mthreads 1.00 ( 1.66) +1.18 ( 2.63)
> normal 4-mthreads 1.00 ( 1.12) +1.20 ( 4.52)
> normal 8-mthreads 1.00 ( 11.03) -3.87 ( 5.14)
>
>
> Baseline+patch1 vs Baseline+patch0+patch1:
>
> With multiple LLC domains in 1 Numa node, SD_IDLE_SIBLING brings improvement
> to hackbench/schbench, while brings downgrading to netperf/tbench. This is aligned
> with what was observed previously, if the waker and wakee wakes up each other
> frequently, they would like to be put together for cache locality. While for
> other tasks do not have shared resource, always choosing an idle CPU is better.
> Maybe in the future we can look back at SIS_SHORT and terminates scan in
> select_idle_node() if the waker and wakee have close relationship with
> each other.

Gautham and I were discussing this and realized that when calling
ttwu_queue_wakelist(), in a simulated split-LLC case, ttwu_queue_cond()
will recommend using the wakelist and send an IPI despite the
groups of the DIE domain sharing the cache in your case.

Can you check if the following change helps the regression?
(Note: Completely untested and there may be other such cases lurking
around that we've not yet considered)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..a8cab1c81aca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3929,7 +3929,7 @@ static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
* If the CPU does not share cache, then queue the task on the
* remote rqs wakelist to avoid accessing remote data.
*/
- if (!cpus_share_cache(smp_processor_id(), cpu))
+ if (cpu_to_node(smp_processor_id()) != cpu_to_node(cpu))
return true;

if (cpu == smp_processor_id())
--

>
>
> hackbench
> =========
> case load baseline(std%) compare%( std%)
> process-pipe 1-groups 1.00 ( 0.25) +31.65 ( 6.77)
> process-pipe 2-groups 1.00 ( 0.28) +29.50 ( 5.35)
> process-pipe 4-groups 1.00 ( 0.08) +16.77 ( 1.30)
> process-pipe 8-groups 1.00 ( 0.20) -5.18 ( 0.04)
> process-sockets 1-groups 1.00 ( 0.23) +13.68 ( 1.28)
> process-sockets 2-groups 1.00 ( 0.16) +11.18 ( 1.87)
> process-sockets 4-groups 1.00 ( 0.23) -0.06 ( 0.21)
> process-sockets 8-groups 1.00 ( 0.36) +2.34 ( 0.15)
> threads-pipe 1-groups 1.00 ( 5.23) +16.38 ( 12.10)
> threads-pipe 2-groups 1.00 ( 1.63) +28.52 ( 5.17)
> threads-pipe 4-groups 1.00 ( 0.77) +23.28 ( 2.42)
> threads-pipe 8-groups 1.00 ( 2.27) +2.35 ( 5.75)
> threads-sockets 1-groups 1.00 ( 2.31) +0.42 ( 1.68)
> threads-sockets 2-groups 1.00 ( 0.56) +3.98 ( 0.65)
> threads-sockets 4-groups 1.00 ( 0.12) +0.29 ( 0.32)
> threads-sockets 8-groups 1.00 ( 0.86) +1.92 ( 0.27)
>
> netperf
> =======
> case load baseline(std%) compare%( std%)
> TCP_RR 56-threads 1.00 ( 12.46) -1.62 ( 12.14)
> TCP_RR 112-threads 1.00 ( 1.34) -0.16 ( 1.42)
> TCP_RR 168-threads 1.00 ( 6.26) -0.88 ( 6.08)
> TCP_RR 224-threads 1.00 ( 2.19) -90.18 ( 6.12)
> TCP_RR 280-threads 1.00 ( 12.27) -63.81 ( 74.25)
> TCP_RR 336-threads 1.00 ( 29.28) -6.21 ( 18.48)
> TCP_RR 392-threads 1.00 ( 39.39) -3.87 ( 26.63)
> TCP_RR 448-threads 1.00 ( 47.45) -2.34 ( 32.37)
> UDP_RR 56-threads 1.00 ( 3.28) -0.31 ( 2.81)
> UDP_RR 112-threads 1.00 ( 7.03) +0.55 ( 7.03)
> UDP_RR 168-threads 1.00 ( 17.42) -0.51 ( 15.63)
> UDP_RR 224-threads 1.00 ( 20.79) -68.28 ( 14.32)
> UDP_RR 280-threads 1.00 ( 26.23) -68.58 ( 18.60)
> UDP_RR 336-threads 1.00 ( 38.99) -0.55 ( 21.19)
> UDP_RR 392-threads 1.00 ( 44.22) -1.91 ( 27.44)
> UDP_RR 448-threads 1.00 ( 55.11) -2.74 ( 38.55)
>
> tbench
> ======
> case load baseline(std%) compare%( std%)
> loopback 56-threads 1.00 ( 2.69) -2.30 ( 2.69)
> loopback 112-threads 1.00 ( 1.92) +0.62 ( 1.46)
> loopback 168-threads 1.00 ( 0.97) -67.69 ( 0.06)
> loopback 224-threads 1.00 ( 0.24) -6.79 ( 8.81)
> loopback 280-threads 1.00 ( 0.10) +0.47 ( 0.62)
> loopback 336-threads 1.00 ( 0.85) -0.05 ( 0.05)
> loopback 392-threads 1.00 ( 0.62) +0.77 ( 0.50)
> loopback 448-threads 1.00 ( 0.36) +0.77 ( 0.77)
>
> schbench
> ========
> case load baseline(std%) compare%( std%)
> normal 1-mthreads 1.00 ( 0.82) +1.44 ( 1.24)
> normal 2-mthreads 1.00 ( 2.13) +1.16 ( 0.41)
> normal 4-mthreads 1.00 ( 3.82) -0.30 ( 1.48)
> normal 8-mthreads 1.00 ( 4.80) +22.43 ( 13.03)
>
> But since the multiple LLC is just a simulation on Intel platform for now,
> the patch is ok and:
>
> Tested-by: Chen Yu <yu.c.chen@xxxxxxxxx>
>
> thanks,
> Chenyu

--
Thanks and Regards,
Prateek