Re: [PATCH v2 6/7] sched: Shard per-LLC shared runqueues

From: Gautham R. Shenoy
Date: Wed Jul 12 2023 - 06:06:56 EST


On Tue, Jul 11, 2023 at 02:57:57PM -0500, David Vernet wrote:
> On Tue, Jul 11, 2023 at 12:49:58PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 10, 2023 at 03:03:41PM -0500, David Vernet wrote:

[..snip..]

> > > +static int shared_runq_shard_idx(const struct shared_runq *runq, int cpu)
> > > +{
> > > + return cpu % runq->num_shards;
> >
> > I would suggest either:
> >
> > (cpu >> 1) % num_shards
> >
> > or keeping num_shards even, to give SMT siblings a fighting chance to
> > hit the same bucket.
>
> Given that neither of these approaches guarantees that the SMT siblings
> are in the same bucket, I'll just go with your suggestion which is
> simpler.
>
> Seems inevitable that we'll want to have another debugfs knob to adjust
> the number of shards, but IMO it's preferable to just apply your
> suggestion in v3 and hold off on adding that complexity until we know we
> need it.
>
> > (I've no idea how SMT4 (or worse SMT8) is typically enumerated, so
> > someone from the Power/Sparc/MIPS world would have to go play with that
> > if they so care)
>
> Yeah, no idea either. If these things end up varying a lot across
> different architectures then we can look into making shard assignment
> architecture specific.

On POWER, the SMT siblings are enumerated in a sequential fashion, i.e

CPU id of a thread = Core_id * threads_per_core + thread_id_within_core.

But IIRC, POWER sets L2 domain as the LLC. On POWER8 (with SMT8) and
POWER9(with SMT4 on Baremetal and SMT8 on VMs), LLC size is 8. Even
with SHARED_RUNQ_SHARD_SZ = 6, there will only be 1 shard with the
current formula

num_shards = max(per_cpu(sd_llc_size, i)/SHARED_RUNQ_SHARD_SZ, 1);

(Aside: with the above formula, on a topology with 6 < sd_llc_size <
12, num_shards will remain 1, with the shard size exceeding the
intended SHARD_SZ. Was this the intention ?)

Even on x86, there is no uniformity in how the SMT threads are
numbered. On AMD EPYC Baremetal, the first threads of all the cores
are enumerated first and then the sibling threads. So, on an EPYC
server with 128 cores in total, the SMT sibings are {0,128}, {1, 129}, ...

With SHARED_RUNQ_SHARD_SZ = 6,

On Zen2 EPYC Baremetal, with LLC size = 8, num_shards = 1. This
simplifies stuff!

On Zen3, Zen4 EPYC Baremetal, with LLC size = 16, num_shards = 2.

Here, (cpu % num_shards) ensures that the SMT siblings belong to the
same shard along with 3 other cores.

On some Intel servers, it is possible that the CPU numbers are
interleaved across the two sockets. On my 2 socket, 32Cores per socket
Ice Lake Server, all the even numbered CPUs are in one socket and all
the odd numbered CPUs in the other socket.

The SMT siblings are {0,64}, {2, 66}, .... on one socket and {1, 65},
{3, 67}, .. on the other.

On this system, LLC size = 64. With SHARED_RUNQ_SHARD_SZ = 6,
num_shards = 10.

So with (cpu % num_shards) the siblings {0, 64} ... will belong to
different shards.

What would be good to have is

1. shard_size determined by individual architectures. If none is
provided, we pick the default shard_size.

2. A sharding scheme which guarantees that SMT siblinngs will belong
to the same shard as long as shard_size is at least as big as the SMT
size.

--
Thanks and Regards
gautham.