Re: x86/mce: Is mce_is_memory_error() incorrect for Intel?

From: Sohil Mehta
Date: Fri Dec 15 2023 - 18:11:59 EST


Thanks Tony for the explanation. It is very helpful.

>> Type Form
>> ---- ----
>> Generic Cache Hierarchy 000F 0000 0000 11LL
>> TLB Errors 000F 0000 0001 TTLL
>> Memory Controller Errors 000F 0000 1MMM CCCC
>> Cache Hierarchy Errors 000F 0001 RRRR TTLL
>> Extended Memory Errors 000F 0010 1MMM CCCC
>> Bus and Interconnect Errors 000F 1PPT RRRR IILL
>>
>> I am not sure what are the practical implications of getting
>> mce_is_memory_error() wrong. (This issue is completely theoretical right
>> now.) Any insights?
>
> This function is used to check whether an address is OS addressable memory
> (i.e. for a page that could be taken offline). That doesn't apply to the caching
> use case (the only way to "offline" such a page would be to offline each of the
> slow memory pages that it might be used for).
>

Makes sense. I am assuming these Extended Memory Errors will not be used
anymore (even for CXL.mem type configs) and we don't need to include
them in the mce_is_memory_error() check? I'll update the comment
accordingly.

> I'm not quite sure why bit 8 (cache hierarchy error) was added into this check,
> It would seem to have the same issues as extended memory.
>

>From a little bit of digging it seems the check for "cache hierarchy
errors" was always there. Commit fa92c5869426 ("x86, mce: Support memory
error recovery for both UCNA and Deferred error in machine_check_poll")
introduced the original checks but maybe the intention at that time was
different? I see that the CEC stuff was added later so maybe the
original memory related failures were handled differently?

Now, should we remove the cache error related check from
mce_is_memory_error()?