RE: [patch v2 15/30] x86/cpu: Detect real BSP on crash kernels

From: Michael Kelley
Date: Wed Jan 31 2024 - 12:59:23 EST


From: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Sent: Tuesday, January 23, 2024 5:11 AM
>
> When a kdump kernel is started from a crashing CPU then there is no
> guarantee that this CPU is the real boot CPU (BSP). If the kdump kernel
> tries to online the BSP then the INIT sequence will reset the machine.
>
> There is a command line option to prevent this, but in case of nested kdump
> kernels this is wrong.
>
> But that command line option is not required at all because the real
> BSP is enumerated as the first CPU by firmware. Support for the only
> known system which was different (Voyager) got removed long ago.
>
> Detect whether the boot CPU APIC ID is the first APIC ID enumerated by
> the firmware. If the first APIC ID enumerated is not matching the boot
> CPU APIC ID then skip registering it.
>
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> ---
> V2: Check for the first enumerated APIC ID (Rui)
> ---

[snip]

> /**
> + * topology_register_apic - Register an APIC in early topology maps
> + * @apic_id: The APIC ID to set up
> + * @acpi_id: The ACPI ID associated to the APIC
> + * @present: True if the corresponding CPU is present
> + */
> +void __init topology_register_apic(u32 apic_id, u32 acpi_id, bool present)
> +{
> + if (apic_id >= MAX_LOCAL_APIC) {
> + pr_err_once("APIC ID %x exceeds kernel limit of: %x\n", apic_id, MAX_LOCAL_APIC - 1);
> + topo_info.nr_rejected_cpus++;
> + return;
> + }
> +
> + /* CPU numbers exhausted? */
> + if (topo_info.nr_assigned_cpus >= nr_cpu_ids) {

I'm seeing a problem here when nr_cpus=1 on the kernel boot line
in an otherwise multiprocessor system. topo_info.nr_assigned_cpus
is statically initialized to "1", so when nr_cpus_ids is "1", this test
is true every time this function is called, including for the first APIC
enumerated from the MADT. The warning message is output once,
and it's correct. But topo_info.nr_rejected_cpus is incremented
when it shouldn't be so the number of rejected CPUs is reported as
the # of CPUs in the system.

The # of rejected CPUs message is just cosmetic. But worse,
check_for_real_bsp() and topo_register_apic() are never called. In
a kdump kernel where the panic'ing CPU is not physical CPU zero,
I don't get any messages from check_for_real_bsp().

Michael

> + pr_warn_once("CPU limit of %d reached. Ignoring further CPUs\n", nr_cpu_ids);
> + topo_info.nr_rejected_cpus++;
> + return;
> + }
> +
> + if (check_for_real_bsp(apic_id))
> + return;
> +
> + topo_register_apic(apic_id, acpi_id, present);
> +}
> +