Re: [RFC PATCH 1/1] sched: ttwu_queue_cond: perform queued wakeups across different L2 caches

From: Mathieu Desnoyers
Date: Fri Aug 18 2023 - 10:35:18 EST


On 8/18/23 02:39, K Prateek Nayak wrote:
Hello Vincent, Mathieu,

On 8/17/2023 9:31 PM, Vincent Guittot wrote:
On Thu, 17 Aug 2023 at 17:34, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:

Skipping queued wakeups for all logical CPUs sharing an LLC means that
on a 192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), groups
of 8 cores (16 hardware threads) end up grabbing runqueue locks of other
runqueues within the same group for each wakeup, causing contention on
the runqueue locks.

Improve this by only considering hardware threads sharing an L2 cache as
candidates for skipping use of the queued wakeups.

This results in the following benchmark improvements:

hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

from 49s to 34s. (30% speedup)

And similarly with perf bench:

perf bench sched messaging -g 32 -p -t -l 100000

from 10.9s to 7.4s (32% speedup)

This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench. It turned out that changing this raw
spinlock for a loop of 10000x cpu_relax within do_idle() had similar
benefits.

This patch achieves a similar effect without busy waiting nor changing
anything about runqueue selection on wakeup. It considers that only
hardware threads sharing an L2 cache should skip the queued
try-to-wakeup and directly grab the target runqueue lock, rather than
allowing all hardware threads sharing an LLC to do so.

I would be interested to hear feedback about performance impact of this
patch (improvement or regression) on other workloads and hardware,
especially for Intel CPUs. One thing that we might want to empirically
figure out from the topology is whether there is a maximum number of
hardware threads within an LLC below which it would make sense to use
the LLC rather than L2 as group within which queued wakeups can be
skipped.

[ Only tested on AMD CPUs so far. ]

Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@xxxxxxx
Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@xxxxxxxxxxxx/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@xxxxxxxxxxxx/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Ben Segall <bsegall@xxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
Cc: Swapnil Sapkal <Swapnil.Sapkal@xxxxxxx>
Cc: Aaron Lu <aaron.lu@xxxxxxxxx>
Cc: x86@xxxxxxxxxx
---
arch/Kconfig | 6 ++++++
arch/x86/Kconfig | 1 +
drivers/base/Kconfig | 1 +
include/linux/sched/topology.h | 3 ++-
kernel/sched/core.c | 26 +++++++++++++++++++++++---
5 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cad..e5aac1741712 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -340,6 +340,12 @@ config HAVE_ASM_MODVERSIONS
<asm/asm-prototypes.h> to support the module versioning for symbols
exported from assembly code.

+config HAVE_CLUSTERGROUP
+ bool
+ help
+ This symbol should be selected by an architecture if it
+ implements CPU clustergroup.
+
config HAVE_REGS_AND_STACK_ACCESS_API
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cb1031018afa..07813a1a9a58 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -299,6 +299,7 @@ config X86
select FUNCTION_ALIGNMENT_4B
imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI
select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
+ select HAVE_CLUSTERGROUP

config INSTRUCTION_DECODER
def_bool y
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 2b8fd6bb7da0..408aaf7a4bd1 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -218,6 +218,7 @@ config DMA_FENCE_TRACE

config GENERIC_ARCH_TOPOLOGY
bool
+ select HAVE_CLUSTERGROUP
help
Enable support for architectures common topology code: e.g., parsing
CPU capacity information from DT, usage of such information for
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 816df6cc444e..714386070463 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -178,7 +178,8 @@ extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);

-bool cpus_share_cache(int this_cpu, int that_cpu);
+bool cpus_share_cluster(int this_cpu, int that_cpu); /* Share L2. */
+bool cpus_share_cache(int this_cpu, int that_cpu); /* Share LLC. */

I think that Yicong is doing what you want with
cpus_share_lowest_cache() which points to cluster when available or
LLC otherwise
https://lore.kernel.org/lkml/20220720081150.22167-1-yangyicong@xxxxxxxxxxxxx/t/#m0ab9fa0fe0c3779b9bbadcfbc1b643dce7cb7618
Please correct me if I'm wrong, but with Yicong's latest version
(https://lore.kernel.org/lkml/20230719092838.2302-2-yangyicong@xxxxxxxxxx/)
"sd_share_id" will follow "sd_llc_id" if the SD_CLUSTER domain is
degenerated, which is the case with any system where CLUSTER domain is
same as SMT domain.

On logging the cpu and sd_share_id on my 3rd Gen EPYC system I see,

CPU(0) sd_share_id(0)
CPU(1) sd_share_id(0)
CPU(2) sd_share_id(0)
CPU(3) sd_share_id(0)
CPU(4) sd_share_id(0)
CPU(5) sd_share_id(0)
CPU(6) sd_share_id(0)
CPU(7) sd_share_id(0)
CPU(8) sd_share_id(8)
CPU(9) sd_share_id(8)
...
CPU(127) sd_share_id(120)
CPU(128) sd_share_id(0)
CPU(129) sd_share_id(0)
CPU(130) sd_share_id(0)
CPU(131) sd_share_id(0)
CPU(132) sd_share_id(0)
CPU(133) sd_share_id(0)
CPU(134) sd_share_id(0)
CPU(135) sd_share_id(0)
CPU(136) sd_share_id(8)
...

"sd_share_id" follows the "sd_llc_id" since CLUSTER domain is
degenerated.

# echo Y > /sys/kernel/debug/sched/verbose
# cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING # SMT Domain
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING # MC Domain
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING # DIE Domain
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA # NUMA Domain

But I believe Mathieu's case would require falling back to "core_id" if
the cluster domain has degenerated.

Yes, I've noticed the same thing on my system. I'm preparing an updated patchset which renames cpus_share_cache() to cpus_share_llc(), and introduces a new cpus_share_l2c() based on a new sd_l2c_id per-cpu variable derived from topology_cluster_cpumask(). It seems to work fine so far.

Thanks,

Mathieu


--
Thanks and Regards,
Prateek

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com