Re: unchecked MSR access error: WRMSR to 0xd84 (tried to write 0x0000000000010003) at rIP: 0xffffffffa025a1b8 (snbep_uncore_msr_init_box+0x38/0x60 [intel_uncore])

From: Liang, Kan
Date: Mon Mar 04 2024 - 14:23:16 EST




On 2024-03-04 1:18 p.m., Borislav Petkov wrote:
> Hi all,
>
> sending this to a bunch of people who have touched this function
> recently and some more relevant Intel folks.
>
> The machine is an old SNB:
>
> smpboot: CPU0: Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz (family: 0x6, model: 0x2d, stepping: 0x7)
>
> and with latest linus/master + tip/master it gives the below.
>
> It must be something new because 6.8-rc6 is fine.
>
> ...
> i801_smbus 0000:00:1f.3: enabling device (0000 -> 0003)
> input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input5
> i801_smbus 0000:00:1f.3: SMBus using PCI interrupt
> ACPI: button: Power Button [PWRF]
> i2c i2c-14: 4/4 memory slots populated (from DMI)
> unchecked MSR access error: WRMSR to 0xd84 (tried to write 0x0000000000010003) at rIP: 0xffffffffa025a1b8 (snbep_uncore_msr_init_box+0x38/0x60 [intel_uncore])

The 0xd84 is the box control MSR of the CBOX 4 (Please find the
definition of the MSR from page 11 of
https://www.intel.com/content/www/us/en/develop/download/intel-xeon-processor-e5-v2-and-e7-v2-product-families-uncore-performance-monitoring.html).

It looks like the driver tries to access the CBOX 4, but it is not
available on the machine.

The number of available CBOXs on a SNBEP machine is determined at boot
time. It should not be larger than the maximum number of cores.
The recent commit 89b0f15f408f ("x86/cpu/topology: Get rid of
cpuinfo::x86_max_cores") change the boot_cpu_data.x86_max_cores to
topology_num_cores_per_package().
I guess the new function probably returns a different maximum number of
cores on the machine. But I don't have a SNBEP on my hands. Could you
please help to check whether a different maximum number of cores is
returned?

Thanks,
Kan

> Call Trace:
> <TASK>
> ? ex_handler_msr+0xcb/0x130
> ? fixup_exception+0x166/0x320
> ? exc_general_protection+0xd7/0x3f0
> ? asm_exc_general_protection+0x22/0x30
> ? snbep_uncore_msr_init_box+0x38/0x60 [intel_uncore]
> uncore_box_ref.part.0+0x9c/0xc0 [intel_uncore]
> ? __pfx_uncore_event_cpu_online+0x10/0x10 [intel_uncore]
> uncore_event_cpu_online+0x56/0x140 [intel_uncore]
> ? __pfx_uncore_event_cpu_online+0x10/0x10 [intel_uncore]
> cpuhp_invoke_callback+0x174/0x5e0
> ? cpuhp_thread_fun+0x5a/0x200
> cpuhp_thread_fun+0x17e/0x200
> ? smpboot_thread_fn+0x2b/0x250
> smpboot_thread_fn+0x1ad/0x250
> ? __pfx_smpboot_thread_fn+0x10/0x10
> kthread+0xed/0x120
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x30/0x50
> ? __pfx_kthread+0x10/0x10
> iTCO_vendor_support: vendor-support=0
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> iTCO_wdt iTCO_wdt.1.auto: Found a Patsburg TCO device (Version=2, TCOBASE=0x0460)
> iTCO_wdt iTCO_wdt.1.auto: initialized. heartbeat=30 sec (nowayout=0)
> RAPL PMU: API unit is 2^-32 Joules, 2 fixed counters, 163840 ms ovfl timer
> ...
>
> Thx.
>