RE: x86/mce: Is mce_is_memory_error() incorrect for Intel?

From: Luck, Tony
Date: Wed Dec 13 2023 - 14:56:16 EST


> The code tries to identify the memory and cache errors by masking the
> status and then comparing based on the bit encodings below. But it seems
> to be missing the "Extended Memory Errors" encoding which may have been
> added after the original code was written.

That's right. The "Extended Memory Errors" were added for systems that use
some "fast" memory as a cache for "slower" memory. Initial use case was for
3D-Xpoint (using this to add capacity rather than making use of persistence).

> Type Form
> ---- ----
> Generic Cache Hierarchy 000F 0000 0000 11LL
> TLB Errors 000F 0000 0001 TTLL
> Memory Controller Errors 000F 0000 1MMM CCCC
> Cache Hierarchy Errors 000F 0001 RRRR TTLL
> Extended Memory Errors 000F 0010 1MMM CCCC
> Bus and Interconnect Errors 000F 1PPT RRRR IILL
>
> I am not sure what are the practical implications of getting
> mce_is_memory_error() wrong. (This issue is completely theoretical right
> now.) Any insights?

This function is used to check whether an address is OS addressable memory
(i.e. for a page that could be taken offline). That doesn't apply to the caching
use case (the only way to "offline" such a page would be to offline each of the
slow memory pages that it might be used for).

I'm not quite sure why bit 8 (cache hierarchy error) was added into this check,
It would seem to have the same issues as extended memory.

> A couple of other points:
>
> - The code seems ripe for a rewrite to be rid of the magic masks and bit
> comparisons. I am thinking of doing that in a separate patch along side
> of rewriting the comment. Would that be useful even if no issue exists?

Maybe the code would be prettier, or at least easier to read, with some well
chosen names for the bits and fields. Only way to tell would be to write the
code and see how it looks.

> - Relying on these bit encodings seems problematic in the long run with
> the possibility of more things that could always be added. Is there a
> better way to do it?

Computer h/w architecture is going to evolve so change is going to be
needed however you code it.

-Tony