Re: [RFC PATCH v3 2/3] sched: Introduce cpus_share_l2c

From: Mathieu Desnoyers
Date: Wed Aug 23 2023 - 14:51:48 EST


On 8/23/23 11:26, Mathieu Desnoyers wrote:
On 8/22/23 07:31, Mathieu Desnoyers wrote:
Introduce cpus_share_l2c to allow querying whether two logical CPUs
share a common L2 cache.

Considering a system like the AMD EPYC 9654 96-Core Processor, the L1
cache has a latency of 4-5 cycles, the L2 cache has a latency of at
least 14ns, whereas the L3 cache has a latency of 50ns [1]. Compared to
this, I measured the RAM accesses to a latency around 120ns on my
system [2]. So L3 really is only 2.4x faster than RAM accesses.
Therefore, with this relatively slow access speed compared to L2, the
scheduler will benefit from only considering CPUs sharing an L2 cache
for the purpose of using remote runqueue locking rather than queued
wakeups.

So I did some more benchmarking to figure out whether the reason for this speedup is the latency delta between L2 and L3, or is due to the number of hw threads contending on the rq locks.

I tried to force grouping of those "skip ttwu queue" groups by a subset of the LLC id, basically by taking the LLC id and adding the cpu number modulo N, where N is chosen based on my machine topology.

The end result is that I have similar numbers for groups of 1, 2, 4 HW threads (which use rq locks and skip queued ttwu within the group). Starting with group of size 8, the performance starts to degrade.

So I wonder: do machines with more than 4 HW threads per L2 cache exist? If it's the case, there we should think about grouping not only by L2 cache, but also sub-divide this group so the number of hw threads per group is at most 4.

Here are my results with the hackbench test-case:

Group cpus by 16 hw threads:

Time: 49s

- group cpus by 8 hw threads: (llc_id + cpu modulo 2)

Time: 39s

- group cpus by 4 hw threads: (llc_id + cpu modulo 4)

Time: 34s

- group cpus by 2 hw threads: (llc_id + cpu modulo 8)
(expect same as L2 grouping on this machine)

Time: 34s

- group cpus by 1 hw threads: (cpu)

Time: 33s

One more interesting data point: I tried modifying the grouping
so that I would explicitly group by hw threads which sit in different
L3, and even on different NUMA nodes for some
(group id = cpu_id % 192). This is expected to generate really _bad_
cache locality for the runqueue locks within a group.

The result for these groups of 3 HW threads is about 33s with the
hackbench benchmark, which seems to confirm that the cause of the
speedup is reduction of the contention on the rq locks by making the
groups smaller, and therefore reducing the likelihood of contention for the rq locks, rather than by improving cache locality from L3 to L2.

So grouping by shared L2 only happens to make the group size OK, but
this benchmark does not significantly benefit from having all runqueue
locks on the same L2.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com