Re: [UNTESTED PATCH] x86, mce: Avoid double entry of deferred errors into the genpool.

From: Borislav Petkov
Date: Mon Nov 23 2015 - 12:59:48 EST


On Thu, Nov 19, 2015 at 09:39:20PM +0100, Borislav Petkov wrote:
> On Thu, Nov 19, 2015 at 07:33:58PM +0000, Luck, Tony wrote:
> > > Applied, thanks.
> >
> > Did you test it (note the "UNTESTED" in the subject!). My usual system for this is getting upgrades and being
> > flaky at the moment.
>
> Bah, it builds, should be enough. Ship it. :-)
>
> Lemme get a box...

Here some results:

# grep . /sys/kernel/debug/apei/einj/*
/sys/kernel/debug/apei/einj/available_error_type:0x00000002 Processor Uncorrectable non-fatal
/sys/kernel/debug/apei/einj/available_error_type:0x00000008 Memory Correctable
/sys/kernel/debug/apei/einj/available_error_type:0x00000010 Memory Uncorrectable non-fatal
grep: /sys/kernel/debug/apei/einj/error_inject: Permission denied
/sys/kernel/debug/apei/einj/error_type:0x0

Looks like some old EINJ without all the features. Oh well, let's see
what'll happen anyway:

# echo 0x8 > error_type
# echo 1 > error_inject

[ 840.461666] mce: [Hardware Error]: Machine check events logged
[ 840.476221] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 840.489214] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 5: 8c00004000010090
[ 840.507685] EDAC sbridge MC0: TSC 0
[ 840.515223] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 840.532477] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448299322 SOCKET 0 APIC 0
[ 840.551279] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 840.563872] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8800004100800090
[ 840.581970] EDAC sbridge MC0: TSC 0
[ 840.589513] EDAC sbridge MC0: ADDR 0 EDAC sbridge MC0: MISC 4908400040004200
[ 840.606267] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448299322 SOCKET 0 APIC 0
[ 841.499090] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)

So yeah, mce_notify_irq() is visible there, i.e. we did mce_log() here
which sets mce_need_notify.

# echo 0x2 > error_type
# echo 1 > error_inject
bash: echo: write error: Invalid argument
[ 885.272000] [Firmware Warn]: APEI: Invalid action table, unknown instruction type: 5

ACPI_EINJ_FLUSH_CACHELINE??

Yeah, we're missing some functionality.

# echo 0x10 > error_type
# echo 1 > error_inject

That went BOOM:

[ 1296.233435] Disabling lock debugging due to kernel taint
[ 1296.248010] mce: [Hardware Error]: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090
[ 1296.269245] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8136260f> {intel_idle+0xbf/0x130}
[ 1296.290735] mce: [Hardware Error]: TSC 37c1fb53beb ADDR bb68f400 MISC 20401a9a86
[ 1296.309772] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1448299778 SOCKET 0 APIC c microcode 710
[ 1296.332058] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 1296.346094] EDAC sbridge MC0: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090
[ 1296.366517] EDAC sbridge MC0: TSC 37c1fb53beb
[ 1296.375974] EDAC sbridge MC0: ADDR bb68f400 EDAC sbridge MC0: MISC 20401a9a86
[ 1296.394493] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448299778 SOCKET 0 APIC c
[ 1296.416153] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68f offset:
0x400 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
...

judging by the CPU numbers, looks like node 0 got that error in the shared bank:

.... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
.... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39

finishing with

[ 1299.907994] mce: [Hardware Error]: Machine check: Processor context corrupt
[ 1299.926783] Kernel panic - not syncing: Fatal machine check
[ 1299.959632] Kernel Offset: disabled
[ 1299.984254] Rebooting in 100 seconds..

dont_log_ce:

$ for i in $(seq 0 63); do echo 1 > /sys/devices/system/machinecheck/machinecheck$i/dont_log_ce; cat /sys/devices/system/machinecheck/machinecheck$i/dont_log_ce; done | uniq
1

# echo 0x8 > error_type
# echo 1 > error_inject

[ 318.263797] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 318.277029] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 5: 8c00004000010090
[ 318.295631] EDAC sbridge MC0: TSC 0
[ 318.303143] EDAC sbridge MC0: ADDR bb68f000 EDAC sbridge MC0: MISC 2040262686
[ 318.320473] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448300397 SOCKET 0 APIC 0
[ 318.809112] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68f offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)

This looks ok, we're missing the mce_notify_irq() line "mce: [Hardware
Error]: Machine check events logged" which is as expected but the EDAC
lines are there because we sent the error on the notify chain.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/