RE: [patch v2 21/38] x86/cpu: Provide cpu_init/parse_topology()

From: Michael Kelley (LINUX)
Date: Mon Jul 31 2023 - 12:26:11 EST


From: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Sent: Monday, July 31, 2023 5:35 AM
>
> On Mon, Jul 31 2023 at 04:05, Michael Kelley wrote:
> >> + /*
> >> + * The initial invocation from early_identify_cpu() happens before
> >> + * the APIC is mapped or X2APIC enabled. For establishing the
> >> + * topology, that's not required. Use the initial APIC ID.
> >> + */
> >> + if (early)
> >> + c->topo.apicid = c->topo.initial_apicid;
> >> + else
> >> + c->topo.apicid = read_apic_id();
> >
> > Using the value from the local APIC ID reg turns out to cause a problem in
> > some Hyper-V VM configurations. If a VM has multiple L3 caches (probably
> > due to multiple NUMA nodes) and the # of CPUs in the span of the L3 cache
> > is not a power of 2, the APIC IDs for the CPUs in the span of the 1st L3 cache
> > are sequential starting with 0. But then there is a gap before starting the
> > APIC IDs for the CPUs in the span of the 2nd L3 cache. The gap is
> > repeated if there are additional L3 caches.
> >
> > The CPUID instruction executed on a guest vCPU correctly reports the APIC
> > IDs. However, the ACPI MADT assigns the APIC IDs sequentially with no
> > gaps, and the guest firmware sets the APIC_ID register for each local APIC
> > to match the MADT. When parse_topology() sets the apicid field based on
> > reading the local APIC ID register, the value it sets is different from the
> > initial_apicid value for CPUs in the span of the 2nd and subsequent L3
> > caches, because there's no gap in the APIC IDs read from the local APIC.
> > Linux boots and runs, but the topology is set up with the wrong span for
> > the L3 cache and for the associated scheduling domains.
>
> TBH. That's an insanity. MADT and the actual APIC ID determine the
> topology. So the gaps should be reflected in MADT and the actual APIC
> IDs should be set correctly if the intent is to provide topology
> information.
>
> Just for the record. This hack works only on Intel today, because AMD
> init sets topo.apicid = read_apic_id() unconditionally. So this is
> inconsistent already, no?
>

Correct. But given that the L3 cache span in the AMD Zen1 and Zen2
processors is only 8 CPUs, there's much less reason to configure a VM
that only uses some of the CPUs in an L3 cache span. Hyper-V does
the APIC ID numbering correctly for Zen3 with its 16 CPUs in the L3
cache span.

> > The old code derives the apicid from the initial_apicid via the
> > phys_pkg_id() callback, so these bad Hyper-V VM configs skate by. The
> > wrong value in the local APIC ID register and MADT does not affect
> > anything, except that the check in validate_apic_and_package_id() fails
> > during boot, and a set of "Firmware bug:" messages is correctly
> > output.
>
> So instead of fixing the firmware bugs, hyper-v just moves on and
> pretends that everything works fine, right?

What can I say. :-(

>
> > Three thoughts:
> >
> > 1) Are Hyper-V VMs the only place where the local APIC ID register might
> > have a bogus value? Probably so, but you never know what might crawl
> > out.
>
> Define bogus. MADT is the primary source of information because that's
> how we know how many CPUs (APICs) are there and what their APIC ID is
> which we can use to wake them up. So there is a reasonable expectation
> that this information is consistent with the rest of the system.

Commit d49597fd3bc7 "x86/cpu: Deal with broken firmware (VMWare/Xen)"
mentions VMware and XEN implementations that violate the spec. The
commit is from late 2016. Have these bad systems aged out and no longer
need accommodation?

>
> The Intel SDM clearly says in Vol 3A section 9.4.5 Identifying Logical
> Processors in an MP System:
>
> "After the BIOS has completed the MP initialization protocol, each
> logical processor can be uniquely identified by its local APIC
> ID. Software can access these APIC IDs in either of the following
> ways:"
>
> These ways include read from APIC, read MADT, read CPUID and implies
> that this must be consistent. For X2APIC it's actually written out:
>
> "If the local APIC unit supports x2APIC and is operating in x2APIC
> mode, 32-bit APIC ID can be read by executing a RDMSR instruction to
> read the processor’s x2APIC ID register. This method is equivalent to
> executing CPUID leaf 0BH described below."
>
> AMD has not been following that in the early 64bit systems as they moved
> the APIC ID space to start at 32 for the first CPU in the first socket
> for whatever reasons. But since then the kernel reads back the APIC ID
> on AMD systems into topo.apicid. But that was long ago and can easily be
> dealt with because at least the real APIC ID and the MADT/MPTABLE
> entries are consistent.
>
> Hypervisors have their own CPUID space to override functionality with
> their own magic stuff, but imposing their nutbolt ideas on the
> architectural part of the system is not only wrong, it's disrespectful
> against the OS developers who try to keep their system sane.
>
> > 2) The natural response is "Well, fix Hyper-V!" I first had this conversation
> > with the Hyper-V team about 5 years ago. Some cases of the problem were
> > fixed, but some cases remain unfixed. It's a long story.
> >
> > 3) Since Hyper-V code in Linux already has an override for the apic->read()
> > function, it's possible to do a hack in that override so that apicid gets set to
> > the same value as initial_apicid, which matches the old code. Here's the diff:
>
> This collides massively with the other work I'm doing, which uses the
> MADT provided information to actually evaluate various topology related
> things upfront and later during bringup. Thats badly needed because lots
> of todays infrastructure is based on heuristics and guesswork.

Fair enough. And I've re-raised the issue with the Hyper-V team.

>
> But it seems I wasted a month on reworking all of this just to be
> stopped cold in the tracks by completely undocumented and unnecessary
> hyper-v abuse.
>
> So if Hyper-V insists on abusing the initial APIC ID as read from CPUID
> for topology information related to L3, then hyper-v should override the
> cache topology mechanism and not impose this insanity on the basic
> topology evaluation infrastructure.
>
> Yours seriously grumpy
>
> tglx