Re: [RFC PATCH 1/1] sched: ttwu_queue_cond: perform queued wakeups across different L2 caches

From: Vincent Guittot
Date: Thu Aug 17 2023 - 12:02:02 EST


On Thu, 17 Aug 2023 at 17:34, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>
> Skipping queued wakeups for all logical CPUs sharing an LLC means that
> on a 192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), groups
> of 8 cores (16 hardware threads) end up grabbing runqueue locks of other
> runqueues within the same group for each wakeup, causing contention on
> the runqueue locks.
>
> Improve this by only considering hardware threads sharing an L2 cache as
> candidates for skipping use of the queued wakeups.
>
> This results in the following benchmark improvements:
>
> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
>
> from 49s to 34s. (30% speedup)
>
> And similarly with perf bench:
>
> perf bench sched messaging -g 32 -p -t -l 100000
>
> from 10.9s to 7.4s (32% speedup)
>
> This was developed as part of the investigation into a weird regression
> reported by AMD where adding a raw spinlock in the scheduler context
> switch accelerated hackbench. It turned out that changing this raw
> spinlock for a loop of 10000x cpu_relax within do_idle() had similar
> benefits.
>
> This patch achieves a similar effect without busy waiting nor changing
> anything about runqueue selection on wakeup. It considers that only
> hardware threads sharing an L2 cache should skip the queued
> try-to-wakeup and directly grab the target runqueue lock, rather than
> allowing all hardware threads sharing an LLC to do so.
>
> I would be interested to hear feedback about performance impact of this
> patch (improvement or regression) on other workloads and hardware,
> especially for Intel CPUs. One thing that we might want to empirically
> figure out from the topology is whether there is a maximum number of
> hardware threads within an LLC below which it would make sense to use
> the LLC rather than L2 as group within which queued wakeups can be
> skipped.
>
> [ Only tested on AMD CPUs so far. ]
>
> Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@xxxxxxx
> Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> Cc: Ben Segall <bsegall@xxxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>
> Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
> Cc: Swapnil Sapkal <Swapnil.Sapkal@xxxxxxx>
> Cc: Aaron Lu <aaron.lu@xxxxxxxxx>
> Cc: x86@xxxxxxxxxx
> ---
> arch/Kconfig | 6 ++++++
> arch/x86/Kconfig | 1 +
> drivers/base/Kconfig | 1 +
> include/linux/sched/topology.h | 3 ++-
> kernel/sched/core.c | 26 +++++++++++++++++++++++---
> 5 files changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 205fd23e0cad..e5aac1741712 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -340,6 +340,12 @@ config HAVE_ASM_MODVERSIONS
> <asm/asm-prototypes.h> to support the module versioning for symbols
> exported from assembly code.
>
> +config HAVE_CLUSTERGROUP
> + bool
> + help
> + This symbol should be selected by an architecture if it
> + implements CPU clustergroup.
> +
> config HAVE_REGS_AND_STACK_ACCESS_API
> bool
> help
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index cb1031018afa..07813a1a9a58 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -299,6 +299,7 @@ config X86
> select FUNCTION_ALIGNMENT_4B
> imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI
> select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
> + select HAVE_CLUSTERGROUP
>
> config INSTRUCTION_DECODER
> def_bool y
> diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> index 2b8fd6bb7da0..408aaf7a4bd1 100644
> --- a/drivers/base/Kconfig
> +++ b/drivers/base/Kconfig
> @@ -218,6 +218,7 @@ config DMA_FENCE_TRACE
>
> config GENERIC_ARCH_TOPOLOGY
> bool
> + select HAVE_CLUSTERGROUP
> help
> Enable support for architectures common topology code: e.g., parsing
> CPU capacity information from DT, usage of such information for
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 816df6cc444e..714386070463 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -178,7 +178,8 @@ extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
> void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
>
> -bool cpus_share_cache(int this_cpu, int that_cpu);
> +bool cpus_share_cluster(int this_cpu, int that_cpu); /* Share L2. */
> +bool cpus_share_cache(int this_cpu, int that_cpu); /* Share LLC. */

I think that Yicong is doing what you want with
cpus_share_lowest_cache() which points to cluster when available or
LLC otherwise
https://lore.kernel.org/lkml/20220720081150.22167-1-yangyicong@xxxxxxxxxxxxx/t/#m0ab9fa0fe0c3779b9bbadcfbc1b643dce7cb7618

>
> typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
> typedef int (*sched_domain_flags_f)(void);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a68d1276bab0..ce3402b81e5e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3904,13 +3904,33 @@ void wake_up_if_idle(int cpu)
> rcu_read_unlock();
> }
>
> +/*
> + * Query whether CPUs share LLC.
> + */
> bool cpus_share_cache(int this_cpu, int that_cpu)
> +{
> + return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
> +}
> +
> +#ifdef CONFIG_HAVE_CLUSTERGROUP
> +/*
> + * Query whether CPUs share L2 cache.
> + */
> +bool cpus_share_cluster(int this_cpu, int that_cpu)
> {
> if (this_cpu == that_cpu)
> return true;
> -
> - return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
> + return cpumask_test_cpu(that_cpu, cpu_clustergroup_mask(this_cpu));
> +}
> +#else
> +/*
> + * Fall-back on querying whether CPUs share LLC.
> + */
> +bool cpus_share_cluster(int this_cpu, int that_cpu)
> +{
> + return cpus_share_cache(this_cpu, that_cpu);
> }
> +#endif
>
> static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
> {
> @@ -3929,7 +3949,7 @@ static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
> * If the CPU does not share cache, then queue the task on the
> * remote rqs wakelist to avoid accessing remote data.
> */
> - if (!cpus_share_cache(smp_processor_id(), cpu))
> + if (!cpus_share_cluster(smp_processor_id(), cpu))
> return true;
>
> if (cpu == smp_processor_id())
> --
> 2.39.2
>