Re: x86/mce: Is mce_is_memory_error() incorrect for Intel?

From: Sohil Mehta
Date: Wed Dec 13 2023 - 14:28:15 EST


On 12/6/2023 7:08 AM, Sohil Mehta wrote:

In an effort to rewrite the below Intel specific comment in
mce_is_memory_error(), I think I may have found an issue. It would be
really helpful if someone can help check the analysis below.

> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 7b397370b4d6..d42122b1afea 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -482,7 +482,8 @@ bool mce_is_memory_error(struct mce *m)
> case X86_VENDOR_INTEL:
> case X86_VENDOR_ZHAOXIN:
> /*
> - * Intel SDM Volume 3B - 15.9.2 Compound Error Codes
> + * Intel SDM: Machine-Check Architecture -> "Compound Error
> + * Codes"
> *
> * Bit 7 of the MCACOD field of IA32_MCi_STATUS is used for
> * indicating a memory error. Bit 8 is used for indicating a


The full comment reads as follows:

* Bit 7 of the MCACOD field of IA32_MCi_STATUS is used for
* indicating a memory error. Bit 8 is used for indicating a
* cache hierarchy error. The combination of bit 2 and bit 3
* is used for indicating a `generic' cache hierarchy error
* But we can't just blindly check the above bits, because if
* bit 11 is set, then it is a bus/interconnect error - and
* either way the above bits just gives more detail on what
* bus/interconnect error happened. Note that bit 12 can be
* ignored, as it's the "filter" bit.
*/

return (m->status & 0xef80) == BIT(7) || ---> Memory Controller Errors
(m->status & 0xef00) == BIT(8) || ---> Cache Hierarchy Errors
(m->status & 0xeffc) == 0xc; ---> Generic Cache Hierarchy

The code tries to identify the memory and cache errors by masking the
status and then comparing based on the bit encodings below. But it seems
to be missing the "Extended Memory Errors" encoding which may have been
added after the original code was written.

Type Form
---- ----
Generic Cache Hierarchy 000F 0000 0000 11LL
TLB Errors 000F 0000 0001 TTLL
Memory Controller Errors 000F 0000 1MMM CCCC
Cache Hierarchy Errors 000F 0001 RRRR TTLL
Extended Memory Errors 000F 0010 1MMM CCCC
Bus and Interconnect Errors 000F 1PPT RRRR IILL

I am not sure what are the practical implications of getting
mce_is_memory_error() wrong. (This issue is completely theoretical right
now.) Any insights?

A couple of other points:

- The code seems ripe for a rewrite to be rid of the magic masks and bit
comparisons. I am thinking of doing that in a separate patch along side
of rewriting the comment. Would that be useful even if no issue exists?

- Relying on these bit encodings seems problematic in the long run with
the possibility of more things that could always be added. Is there a
better way to do it?

Sohil