Re: [PATCH] x86/mce/amd: init mce severity to handle deferred memory failure

From: Yazen Ghannam
Date: Wed May 10 2023 - 09:59:58 EST


On 5/9/23 10:17 PM, Shuai Xue wrote:
>
>
> On 2023/5/9 22:25, Yazen Ghannam wrote:
>> On 4/25/23 8:18 AM, Shuai Xue wrote:
>>> When a deferred UE error is detected, e.g by background patrol scruber, it
>>> will be handled in APIC interrupt handler amd_deferred_error_interrupt().
>>> The handler will collect MCA banks, init mce struct and process it by
>>> nofitying the registered MCE decode chain.
>>>
>>> The uc_decode_notifier, one of MCE decode chain, will process memory
>>> failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY.
>>> However, APIC interrupt handler does not init mce severity and the
>>> uninitialized severity is 0 (MCE_NO_SEVERITY).
>>>
>>> To handle the deferred memory failure case, init mce severity when logging
>>> MCA banks.
>>>
>>> Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>
>>>
>>
>> Hi Shuai Xue,
>>
>> I think this patch is fair to do. But it won't have the intended effect
>> in practice.
>>
>> The value in MCA_ADDR for DRAM ECC errors will be a memory controller
>> "normalized address". This is not a system physical address that the OS
>> can use to take action.
>>
>> The mce_usable_address() function needs to be updated to handle this.
>> I'll send a patchset this week to do so. Afterwards, the
>> uc_decode_notifier will not attempt to handle these errors.
>
> From the experience of other platforms (e.g. ARM64 RAS and Intel MCA),
> uc_decode_notifier should handle these error to hard offline the corrupted
> page. If the corrupted page is a free buddy page, we can isolate it and avoid
> using the page in the future.
>
> In my test case, the error is detected by patrol scrubber in memory controller.
> The scrubber may lack of system address space perspective, and only reports
> "normalized address". But we can decode the "normalized address" to system address
> by EDAC (umc_normaddr_to_sysaddr), right?
>
> (I am not quite familiar with AMD RAS, please correct me if I am wrong)
>

Yes, that's correct.

The address translation requires some updates that are still in-review.
Afterwards, we can investigate ways to use the translated address. It
may require some rework in the MCE notifier chain or, more simply,
calling memory_failure() from the EDAC module itself.

Thanks,
Yazen