Re: [RFC PATCH 1/1] sched: ttwu_queue_cond: perform queued wakeups across different L2 caches

From: K Prateek Nayak
Date: Fri Aug 18 2023 - 02:41:09 EST

Next message: Justin Lai: "RE: [PATCH net-next v3 1/2] net/ethernet/realtek: Add Realtek automotive PCIe driver code"
Previous message: Michael Ellerman: "Re: [PATCH 3/9] powerpc: Remove <asm/ide.h>"
In reply to: Vincent Guittot: "Re: [RFC PATCH 1/1] sched: ttwu_queue_cond: perform queued wakeups across different L2 caches"
Next in thread: Mathieu Desnoyers: "Re: [RFC PATCH 1/1] sched: ttwu_queue_cond: perform queued wakeups across different L2 caches"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello Vincent, Mathieu,

On 8/17/2023 9:31 PM, Vincent Guittot wrote:
> On Thu, 17 Aug 2023 at 17:34, Mathieu Desnoyers
> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>>
>> Skipping queued wakeups for all logical CPUs sharing an LLC means that
>> on a 192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), groups
>> of 8 cores (16 hardware threads) end up grabbing runqueue locks of other
>> runqueues within the same group for each wakeup, causing contention on
>> the runqueue locks.
>>
>> Improve this by only considering hardware threads sharing an L2 cache as
>> candidates for skipping use of the queued wakeups.
>>
>> This results in the following benchmark improvements:
>>
>> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
>>
>> from 49s to 34s. (30% speedup)
>>
>> And similarly with perf bench:
>>
>> perf bench sched messaging -g 32 -p -t -l 100000
>>
>> from 10.9s to 7.4s (32% speedup)
>>
>> This was developed as part of the investigation into a weird regression
>> reported by AMD where adding a raw spinlock in the scheduler context
>> switch accelerated hackbench. It turned out that changing this raw
>> spinlock for a loop of 10000x cpu_relax within do_idle() had similar
>> benefits.
>>
>> This patch achieves a similar effect without busy waiting nor changing
>> anything about runqueue selection on wakeup. It considers that only
>> hardware threads sharing an L2 cache should skip the queued
>> try-to-wakeup and directly grab the target runqueue lock, rather than
>> allowing all hardware threads sharing an LLC to do so.
>>
>> I would be interested to hear feedback about performance impact of this
>> patch (improvement or regression) on other workloads and hardware,
>> especially for Intel CPUs. One thing that we might want to empirically
>> figure out from the topology is whether there is a maximum number of
>> hardware threads within an LLC below which it would make sense to use
>> the LLC rather than L2 as group within which queued wakeups can be
>> skipped.
>>
>> [ Only tested on AMD CPUs so far. ]
>>
>> Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@xxxxxxx
>> Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@xxxxxxxxxxxx/
>> Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@xxxxxxxxxxxx/
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
>> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
>> Cc: Ben Segall <bsegall@xxxxxxxxxx>
>> Cc: Mel Gorman <mgorman@xxxxxxx>
>> Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
>> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>> Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
>> Cc: Swapnil Sapkal <Swapnil.Sapkal@xxxxxxx>
>> Cc: Aaron Lu <aaron.lu@xxxxxxxxx>
>> Cc: x86@xxxxxxxxxx
>> ---
>> arch/Kconfig | 6 ++++++
>> arch/x86/Kconfig | 1 +
>> drivers/base/Kconfig | 1 +
>> include/linux/sched/topology.h | 3 ++-
>> kernel/sched/core.c | 26 +++++++++++++++++++++++---
>> 5 files changed, 33 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index 205fd23e0cad..e5aac1741712 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -340,6 +340,12 @@ config HAVE_ASM_MODVERSIONS
>> <asm/asm-prototypes.h> to support the module versioning for symbols
>> exported from assembly code.
>>
>> +config HAVE_CLUSTERGROUP
>> + bool
>> + help
>> + This symbol should be selected by an architecture if it
>> + implements CPU clustergroup.
>> +
>> config HAVE_REGS_AND_STACK_ACCESS_API
>> bool
>> help
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index cb1031018afa..07813a1a9a58 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -299,6 +299,7 @@ config X86
>> select FUNCTION_ALIGNMENT_4B
>> imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI
>> select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
>> + select HAVE_CLUSTERGROUP
>>
>> config INSTRUCTION_DECODER
>> def_bool y
>> diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
>> index 2b8fd6bb7da0..408aaf7a4bd1 100644
>> --- a/drivers/base/Kconfig
>> +++ b/drivers/base/Kconfig
>> @@ -218,6 +218,7 @@ config DMA_FENCE_TRACE
>>
>> config GENERIC_ARCH_TOPOLOGY
>> bool
>> + select HAVE_CLUSTERGROUP
>> help
>> Enable support for architectures common topology code: e.g., parsing
>> CPU capacity information from DT, usage of such information for
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 816df6cc444e..714386070463 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -178,7 +178,8 @@ extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
>> cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
>> void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
>>
>> -bool cpus_share_cache(int this_cpu, int that_cpu);
>> +bool cpus_share_cluster(int this_cpu, int that_cpu); /* Share L2. */
>> +bool cpus_share_cache(int this_cpu, int that_cpu); /* Share LLC. */
>
> I think that Yicong is doing what you want with
> cpus_share_lowest_cache() which points to cluster when available or
> LLC otherwise
> https://lore.kernel.org/lkml/20220720081150.22167-1-yangyicong@xxxxxxxxxxxxx/t/#m0ab9fa0fe0c3779b9bbadcfbc1b643dce7cb7618
Please correct me if I'm wrong, but with Yicong's latest version
(https://lore.kernel.org/lkml/20230719092838.2302-2-yangyicong@xxxxxxxxxx/)
"sd_share_id" will follow "sd_llc_id" if the SD_CLUSTER domain is
degenerated, which is the case with any system where CLUSTER domain is
same as SMT domain.

On logging the cpu and sd_share_id on my 3rd Gen EPYC system I see,

CPU(0) sd_share_id(0)
CPU(1) sd_share_id(0)
CPU(2) sd_share_id(0)
CPU(3) sd_share_id(0)
CPU(4) sd_share_id(0)
CPU(5) sd_share_id(0)
CPU(6) sd_share_id(0)
CPU(7) sd_share_id(0)
CPU(8) sd_share_id(8)
CPU(9) sd_share_id(8)
...
CPU(127) sd_share_id(120)
CPU(128) sd_share_id(0)
CPU(129) sd_share_id(0)
CPU(130) sd_share_id(0)
CPU(131) sd_share_id(0)
CPU(132) sd_share_id(0)
CPU(133) sd_share_id(0)
CPU(134) sd_share_id(0)
CPU(135) sd_share_id(0)
CPU(136) sd_share_id(8)
...

"sd_share_id" follows the "sd_llc_id" since CLUSTER domain is
degenerated.

# echo Y > /sys/kernel/debug/sched/verbose
# cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING # SMT Domain
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING # MC Domain
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING # DIE Domain
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA # NUMA Domain

But I believe Mathieu's case would require falling back to "core_id" if
the cluster domain has degenerated.

--
Thanks and Regards,
Prateek

Next message: Justin Lai: "RE: [PATCH net-next v3 1/2] net/ethernet/realtek: Add Realtek automotive PCIe driver code"
Previous message: Michael Ellerman: "Re: [PATCH 3/9] powerpc: Remove <asm/ide.h>"
In reply to: Vincent Guittot: "Re: [RFC PATCH 1/1] sched: ttwu_queue_cond: perform queued wakeups across different L2 caches"
Next in thread: Mathieu Desnoyers: "Re: [RFC PATCH 1/1] sched: ttwu_queue_cond: perform queued wakeups across different L2 caches"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]