[PATCH] x86/mce/amd: init mce severity to handle deferred memory failure

From: Shuai Xue
Date: Tue Apr 25 2023 - 08:18:54 EST


When a deferred UE error is detected, e.g by background patrol scruber, it
will be handled in APIC interrupt handler amd_deferred_error_interrupt().
The handler will collect MCA banks, init mce struct and process it by
nofitying the registered MCE decode chain.

The uc_decode_notifier, one of MCE decode chain, will process memory
failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY.
However, APIC interrupt handler does not init mce severity and the
uninitialized severity is 0 (MCE_NO_SEVERITY).

To handle the deferred memory failure case, init mce severity when logging
MCA banks.

Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>

---
Steps to reproduce:

step 1: inject a patrol scrub error by ras-tools
#einj_mem_uc patrol

step 2: check dmesg, no memory failure log
#dmesg -c
[51295.686806] mce: [Hardware Error]: Machine check events logged
[51295.693566] mce->status: 0x942031000400011b
[51295.698248] mce->misc: 0x00000000
[51295.701952] mce->severity: 0x00000000 # Manually added printk
[51295.726640] [Hardware Error]: Deferred error, no action required.
[51295.733448] [Hardware Error]: CPU:65 (19:11:1) MC21_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0x942031000400011b
[51295.733452] [Hardware Error]: Error Addr: 0x0000000006350a00
[51295.733453] [Hardware Error]: PPIN: 0x02b69e294c148024
[51295.733453] [Hardware Error]: IPID: 0x0000109600250f00, Syndrome: 0x9a4a00000b800000
[51295.733455] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[51295.733463] mce: umc_normaddr_to_sysaddr: Invalid DramBaseAddress range: 0x0.
[51295.733471] EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#0channel#2 (csrow:0 channel:2 page:0x0 offset:0x0 grain:64)
[51295.733471] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

After this fix:

[ 514.966892] mce: [Hardware Error]: Machine check events logged
[ 514.966912] mce->status: 0x942031000400011b
[ 514.978093] mce->misc: 0x00000000
[ 514.981796] mce->severity: 0x00000001
[ 514.985885] <uc_decode_notifier> pre_handler: p->addr = 0x00000000e09e69e4, ip = ffffffff8104b955, flags = 0x282
[ 514.997253] <uc_decode_notifier> post_handler: p->addr = 0x00000000e09e69e4, flags = 0x282
[ 515.006501] Memory failure: 0x5dc2: recovery action for free buddy page: Recovered
[ 515.015188] [Hardware Error]: Deferred error, no action required.
[ 515.022006] [Hardware Error]: CPU:67 (19:11:1) MC21_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0x942031000400011b
[ 515.034440] [Hardware Error]: Error Addr: 0x0000000005dc2a00
[ 515.034442] [Hardware Error]: PPIN: 0x02b69e294c148024
[ 515.034443] [Hardware Error]: IPID: 0x0000109600650f00, Syndrome: 0x9a4a00000b800008
[ 515.034445] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[ 515.034453] umc_normaddr_to_sysaddr: Invalid DramBaseAddress range: 0x0.
[ 515.034458] EDAC MC1: 1 UE Cannot decode normalized address on mc#1csrow#0channel#6 (csrow:0 channel:6 page:0x0 offset:0x0 grain:64)
[ 515.034461] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

Note, the memory_failure handles wrong physical address because
umc_normaddr_to_sysaddr fails. I don't figure out why it fails.
---
arch/x86/kernel/cpu/mce/amd.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 23c5072fbbb7..b5e1a27b0881 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -734,6 +734,7 @@ static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)
m.misc = misc;
m.bank = bank;
m.tsc = rdtsc();
+ m.severity = mce_severity(&m, NULL, NULL, false);

if (m.status & MCI_STATUS_ADDRV) {
m.addr = addr;
--
2.20.1.12.g72788fdb