Re: [RFC PATCH] topology: Represent clusters of CPUs within a die.

From: Jonathan Cameron
Date: Mon Oct 19 2020 - 12:02:21 EST


On Mon, 19 Oct 2020 16:51:06 +0100
Valentin Schneider <valentin.schneider@xxxxxxx> wrote:

> On 19/10/20 15:27, Jonathan Cameron wrote:
> > On Mon, 19 Oct 2020 14:48:02 +0100
> > Valentin Schneider <valentin.schneider@xxxxxxx> wrote:
> >>
> >> That's my queue to paste some of that stuff I've been rambling on and off
> >> about!
> >>
> >> With regards to cache / interconnect layout, I do believe that if we
> >> want to support in the scheduler itself then we should leverage some
> >> distance table rather than to create X extra scheduler topology levels.
> >>
> >> I had a chat with Jeremy on the ACPI side of that sometime ago. IIRC given
> >> that SLIT gives us a distance value between any two PXM, we could directly
> >> express core-to-core distance in that table. With that (and if that still
> >> lets us properly discover NUMA node spans), we could let the scheduler
> >> build dynamic NUMA-like topology levels representing the inner quirks of
> >> the cache / interconnect layout.
> >
> > You would rapidly run into the problem SLIT had for numa node description.
> > There is no consistent description of distance and except in the vaguest
> > sense or 'nearer' it wasn't any use for anything. That is why HMAT
> > came along. It's far from perfect but it is a step up.
> >
>
> I wasn't aware of HMAT; my feeble ACPI knowledge is limited to SRAT / SLIT
> / PPTT, so thanks for pointing this out.
>
> > I can't see how you'd generalize those particular tables to do anything
> > for intercore comms without breaking their use for NUMA, but something
> > a bit similar might work.
> >
>
> Right, there's the issue of still being able to determine NUMA node
> boundaries.

Backwards compatibility will break you there. I'd definitely look at a separate
table. Problem with SLIT etc is that, as static tables, we can't play games
with OSC bits to negotiate what the OS and the firmware both understand.

>
> > A lot of thought has gone in (and meeting time) to try an improve the
> > situation for complex topology around NUMA. Whilst there are differences
> > in representing the internal interconnects and caches it seems like a somewhat
> > similar problem. The issue there is it is really really hard to describe
> > this stuff with enough detail to be useful, but simple enough to be usable.
> >
> > https://lore.kernel.org/linux-mm/20181203233509.20671-1-jglisse@xxxxxxxxxx/
> >
>
> Thanks for the link!
>
> >>
> >> It's mostly pipe dreams for now, but there seems to be more and more
> >> hardware where that would make sense; somewhat recently the PowerPC guys
> >> added something to their arch-specific code in that regards.
> >
> > Pipe dream == something to work on ;)
> >
> > ACPI has a nice code first model of updating the spec now, so we can discuss
> > this one in public, and propose spec changes only once we have an implementation
> > proven.
> >
>
> FWIW I blabbered about a "generalization" of NUMA domains & distances
> within the scheduler at LPC19 (and have been pasting that occasionally,
> apologies for the broken record):
>
> https://linuxplumbersconf.org/event/4/contributions/484/
>
> I've only pondered about the implementation, but if (big if; also I really
> despise advertising "the one solution that will solve all your issues"
> which this is starting to sound like) it would help I could cobble together
> an RFC leveraging a separate distance table.

It would certainly be interesting.

>
> It doesn't solve the "funneling cache properties into a single number"
> issue, which as you just pointed out in a parallel email is a separate
> discussion altogether.
>
> > Note I'm not proposing we put the cluster stuff in the scheduler, just
> > provide it as a hint to userspace.
> >
>
> The goal being to tweak tasks' affinities, right? Other than CPU pinning
> and rare cases, IMO if the userspace has to mess around with affinities it
> is due to the failings of the underlying scheduler. Restricted CPU
> affinities is also something the load-balancer struggles with; I have and
> have been fighting over such issues where just a single per-CPU kworker
> waking up at the wrong time can mess up load-balance for quite some time. I
> tend to phrase it as: "if you're rude to the scheduler, it can and will
> respond in kind".
>
> Now yes, it's not the same timescale nor amount of work, but this is
> something the scheduler itself should leverage, not userspace.

Ideally I absolutely agree, but then we get into the games of trying to
classify the types of workload which would benefit. Much like with
NUMA spreading, it is going to be hard to come up with a one true
solution (nice though that would be!)

Not getting regressions with anything in this area is going to be
really tricky.

J


>
> > Jonathan