Re: [PATCH] x86/mce/AMD: Fix partial SMCA bank init when CPU 0 != thread 0

From: Jack Miller
Date: Wed Jun 28 2017 - 13:44:26 EST


On Wed, Jun 28, 2017 at 4:22 AM, Borislav Petkov <bp@xxxxxxx> wrote:
> On Tue, Jun 27, 2017 at 07:06:30PM -0500, Jack Miller wrote:
>> After a call to firmware SwitchBSP(),
>
> What is that and who does that?

SwitchBSP() is part of the UEFI MPServices Protocol which I believe is
an extension but it is supported by all of the firmwares I've tested
on.

In this case, I'm using a bootloader to SwitchBSP() so that hardware
thread 0 (and thus core 0) can be offlined on AMD hardware
(cpu0_hotplug unsupported). This is currently working by passing
'nomce' to the kernel, but obviously I'd prefer not to disable it.

>
>> Linux can be booted with a thread
>> that isn't the first in the system. That thread automatically becomes
>> CPU 0.
>
> Btw, you should be seeing other explosions too as a lot of code assumes
> CPU 0 is the BSP.

Actually, with 'nomce' or this patch applied the system seems to chug
along merrily, no further errors in dmesg, no further BUGs. Linux
still gets all of the topology correct (i.e. CPU 0's
core/thread/siblings are correctly identified) so really, aside from
userspace programs doing naive stuff with CPU affinity (like expecting
even,odd CPUs to be SMT pairs), I think the overall result here is
that most threads are interchangeable... except when probing certain
features like these MCA types.

>
> ...
>
>> Signed-off-by: Jack Miller <jack@xxxxxxxxxxx>
>> ---
>> arch/x86/kernel/cpu/mcheck/mce_amd.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
>> index 6e4a047e4b68..9d74adcf34d2 100644
>> --- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
>> +++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
>> @@ -170,8 +170,8 @@ static void get_smca_bank_info(unsigned int bank)
>> struct smca_hwid *s_hwid;
>> u32 high, instance_id;
>>
>> - /* Collect bank_info using CPU 0 for now. */
>> - if (cpu)
>> + /* Collect bank_info using hardware thread 0 for now. */
>> + if (apic->get_apic_id(apic->read(APIC_ID)) != 0)
>
> Does
>
> if (cpu != boot_cpu_data.cpu_index)
> return;
>
> work?

Unfortunately, it doesn't. That value is explicitly set to 0. Most
mechanisms operate around CPU #, which isn't very helpful if the BSP
was changed under the covers.

Alternatively, we could possibly sidestep the APIC ID uncertainty by
patching get_smca_bank_info() to fallback on reading the bank
hwid_mcatype from other online CPUs (it's already using
rdmsr_safe_on_cpu) if its own hwid_mcatype isn't valid/recognized, but
that's a more invasive patch.

Jack