Re: unchecked MSR access error: WRMSR to 0xd84 (tried to write 0x0000000000010003) at rIP: 0xffffffffa025a1b8 (snbep_uncore_msr_init_box+0x38/0x60 [intel_uncore])

From: Borislav Petkov
Date: Wed Mar 06 2024 - 07:33:16 EST


On Wed, Mar 06, 2024 at 12:17:02PM +0100, Thomas Gleixner wrote:
> On Tue, Mar 05 2024 at 13:10, Borislav Petkov wrote:
> > I guess ship it but we'll pay attention to what else ends up
> > complaining.
>
> Here is an updated version which handles it in the topology core code so
> that MPPARSE is covered as well.
>
> Thanks,
>
> tglx
> ---
> Subject: x86/topology: Ignore non-present APIC IDs in a present package
> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Date: Tue, 05 Mar 2024 10:57:26 +0100
>
> Borislav reported that one of his systems has a broken MADT table which
> advertises eight present APICs and 24 non-present APICs in the same
> package.
>
> The non-present ones are considered hot-pluggable by the topology
> evaluation code, which is obviously bogus as there is no way to hot-plug
> within the same package.
>
> As the topology evaluation code accounts for hot-pluggable CPUs in a
> package, the maximum number of cores per package is computed wrong, which
> in turn causes the uncore performance counter driver to access non-existing
> MSRs. It will probably confuse other entities which rely on the maximum
> number of cores and threads per package too.
>
> Cure this by ignoring hot-pluggable APIC IDs within a present package.
>
> In theory it would be reasonable to just do this unconditionally, but then
> there is this thing called reality^Wvirtualization which ruins
> everything. Virtualization is the only existing user of "physical" hotplug
> and the virtualization tools allow the above scenario. Whether that is
> actually in use or not is unknown.
>
> As it can be argued that the virtualization case is not affected by the
> issues which exposed the reported problem, allow the bogosity if the kernel
> determined that it is running in a VM for now.
>
> Reported-by: Borislav Petkov (AMD) <bp@xxxxxxxxx>
> Fixes: 89b0f15f408f ("x86/cpu/topology: Get rid of cpuinfo::x86_max_cores")
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> ---
> arch/x86/kernel/cpu/topology.c | 38 +++++++++++++++++++++++++++++---------
> 1 file changed, 29 insertions(+), 9 deletions(-)
>
> --- a/arch/x86/kernel/cpu/topology.c
> +++ b/arch/x86/kernel/cpu/topology.c

#include <asm/hypervisor.h>

at the top here.

With that, relevant new lines from dmesg:

+CPU topo: Ignoring hot-pluggable APIC ID 8 in present package.

and

@@ -129,9 +130,10 @@ CPU topo: Max. logical packages: 1
CPU topo: Max. logical dies: 1
CPU topo: Max. dies per package: 1
CPU topo: Max. threads per core: 2
-CPU topo: Num. cores per package: 16
-CPU topo: Num. threads per package: 32
-CPU topo: Allowing 8 present CPUs plus 24 hotplug CPUs
+CPU topo: Num. cores per package: 4
+CPU topo: Num. threads per package: 8
+CPU topo: Allowing 8 present CPUs plus 0 hotplug CPUs
+CPU topo: Rejected CPUs 24

AFAIC, ship it.

Tested-by: Borislav Petkov (AMD) <bp@xxxxxxxxx>

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette