Re: [PATCH] x86/mce/AMD: Fix partial SMCA bank init when CPU 0 != thread 0

From: Borislav Petkov
Date: Wed Jun 28 2017 - 14:17:16 EST


On Wed, Jun 28, 2017 at 12:44:17PM -0500, Jack Miller wrote:
> SwitchBSP() is part of the UEFI MPServices Protocol which I believe is
> an extension but it is supported by all of the firmwares I've tested
> on.

Damn, that ubiquitous firmware. One day the kernel will be just a
userspace process to the fw.

> In this case, I'm using a bootloader to SwitchBSP() so that hardware
> thread 0 (and thus core 0) can be offlined on AMD hardware
> (cpu0_hotplug unsupported).

Why unsupported?

I remember doing some quick experiments with booting with "cpu0_hotplug"
and being able to offline the BSP. It was a long time ago though.

> This is currently working by passing 'nomce' to the kernel, but
> obviously I'd prefer not to disable it.

Right, nomce is not an optimal setting.

> Actually, with 'nomce' or this patch applied the system seems to chug
> along merrily, no further errors in dmesg, no further BUGs. Linux
> still gets all of the topology correct (i.e. CPU 0's
> core/thread/siblings are correctly identified) so really, aside from
> userspace programs doing naive stuff with CPU affinity (like expecting
> even,odd CPUs to be SMT pairs), I think the overall result here is
> that most threads are interchangeable... except when probing certain
> features like these MCA types.

May I ask what your goal is? Or is it sekrit stuff? physical hotplug
maybe?

> Unfortunately, it doesn't. That value is explicitly set to 0.

Yeah, I see it in smp_store_boot_cpu_info().

So if we had to be really correct, that code there should set the
*actual* CPU index of the BSP and not simply write a 0. It's that

BSP index == 0

assumption I've been talking about.

> Most mechanisms operate around CPU #, which isn't very helpful if the
> BSP was changed under the covers.
>
> Alternatively, we could possibly sidestep the APIC ID uncertainty by
> patching get_smca_bank_info() to fallback on reading the bank
> hwid_mcatype from other online CPUs (it's already using
> rdmsr_safe_on_cpu) if its own hwid_mcatype isn't valid/recognized, but
> that's a more invasive patch.

Yeah, I think there is some distinction whether you read the MSRs on the
BSP and on the other threads. Yazen did that in

5896820e0aa3 ("x86/mce/AMD, EDAC/mce_amd: Define and use tables for known SMCA IP types")

Yazen, why CPU 0? Can we get rid of that check there?

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix ImendÃrffer, Jane Smithard, Graham Norton, HRB 21284 (AG NÃrnberg)
--