Re: [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC

From: Dave Hansen
Date: Tue Nov 07 2017 - 11:22:25 EST


On 11/07/2017 12:30 AM, Peter Zijlstra wrote:
> On Mon, Nov 06, 2017 at 02:15:00PM -0800, Dave Hansen wrote:
>
>> But, the CPUID for the SNC configuration discussed above enumerates
>> the LLC as being shared by the entire package. This is not 100%
>> precise because the entire cache is not usable by all accesses. But,
>> it *is* the way the hardware enumerates itself, and this is not likely
>> to change.
>
> So CPUID and SRAT will remain inconsistent; even in future products?
> That would absolutely blow chunks.

It certainly isn't ideal as it stands. If it was changed, what would it
be changed to? You can not even represent the current L3 topology in
CPUID, at least not precisely.

I've been arguing we should optimize the CPUID information for
performance. Right now, it's suboptimal for folks doing NUMA-local
allocations, and I think that's precisely the group of folks that needs
precise information. I'm trying to get it changed going forward.

> If that is the case, we'd best use a fake feature like
> X86_BUG_TOPOLOGY_BROKEN and use that instead of an ever growing list of
> models in this code.

FWIW, I don't consider the current situation broken. Nobody ever
promised the kernel that a NUMA node would never happen inside a socket,
or inside a cache boundary enumerated in CPUID.

The assumptions the kernel made were sane, but the CPU's description of
itself, *and* the BIOS-provided information are also sane. But, the
world changed, some of those assumptions turned out to be wrong, and
somebody needs to adjust.

...
>> + if (!topology_same_node(c, o) &&
>> + (c->x86_model == INTEL_FAM6_SKYLAKE_X)) {
>
> This needs a c->x86_vendor test; imagine the fun when AMD releases a
> part with model == SKX ...

Yup, will do.

>> + /* Use NUMA instead of coregroups for scheduling: */
>> + x86_has_numa_in_package = true;
>> +
>> + /*
>> + * Now, tell the truth, that the LLC matches. But,
>> + * note that throwing away coregroups for
>> + * scheduling means this will have no actual effect.
>> + */
>> + return true;
>
> What are the ramifications here? Is anybody else using that cpumask
> outside of the scheduler topology setup?

I looked for it and didn't see anything else. I'll double check that
nothing has popped up since I hacked this together.