Re: [PATCH] x86/mce/amd: init mce severity to handle deferred memory failure

From: Yazen Ghannam
Date: Tue May 09 2023 - 10:26:03 EST


On 4/25/23 8:18 AM, Shuai Xue wrote:
> When a deferred UE error is detected, e.g by background patrol scruber, it
> will be handled in APIC interrupt handler amd_deferred_error_interrupt().
> The handler will collect MCA banks, init mce struct and process it by
> nofitying the registered MCE decode chain.
>
> The uc_decode_notifier, one of MCE decode chain, will process memory
> failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY.
> However, APIC interrupt handler does not init mce severity and the
> uninitialized severity is 0 (MCE_NO_SEVERITY).
>
> To handle the deferred memory failure case, init mce severity when logging
> MCA banks.
>
> Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>
>

Hi Shuai Xue,

I think this patch is fair to do. But it won't have the intended effect
in practice.

The value in MCA_ADDR for DRAM ECC errors will be a memory controller
"normalized address". This is not a system physical address that the OS
can use to take action.

The mce_usable_address() function needs to be updated to handle this.
I'll send a patchset this week to do so. Afterwards, the
uc_decode_notifier will not attempt to handle these errors.

Thanks,
Yazen