Re: [RFC] x86/mce: Add workaround for SKX/CLX/CPX spurious machine checks

From: Luck, Tony
Date: Mon Feb 07 2022 - 13:32:43 EST


On Sun, Feb 06, 2022 at 08:36:40PM -0800, Jue Wang wrote:
> +static bool quirk_skylake_repmov(void)
> +{
> + /*
> + * State that represents if an SRAR MCE has already signaled on the DCU bank.
> + */
> + static DEFINE_PER_CPU(bool, srar_dcu_signaled);
> +
> + if (unlikely(!__this_cpu_read(srar_dcu_signaled))) {
> + u64 mc1_status = mce_rdmsrl(MSR_IA32_MCx_STATUS(1));

Jue,

When I reviewed this for you off-list, I didn't notice that you
dropped the test for mcgstatus & MCG_STATUS_LMCES as part of
moving to a helper function and expanding the test for more
bits in mc1_status.

I think that test still is still important ... knowing that this is
a *local* machine check before making decision based on just what this
CPU observes makes this a bit more robust.

> +
> + if (is_intel_srar(mc1_status)) {
> + __this_cpu_write(srar_dcu_signaled, true);
> + msr_clear_bit(MSR_IA32_MISC_ENABLE,
> + MSR_IA32_MISC_ENABLE_FAST_STRING_BIT);
> + mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> + mce_wrmsrl(MSR_IA32_MCx_STATUS(1), 0);
> + pr_err("First SRAR MCE on DCU, CPU: %d, disable fast string copy.\n",
> + smp_processor_id());
> + return true;
> + }
> + }
> + return false;
> +}

-Tony