Re: [RFC PATCH 0/14] amd64_edac: marry mcheck to amd64 edac

From: Doug Thompson
Date: Mon Jul 20 2009 - 13:24:21 EST


--- On Mon, 7/20/09, Borislav Petkov <borislav.petkov@xxxxxxx> wrote:

> From: Borislav Petkov <borislav.petkov@xxxxxxx>
> Subject: [RFC PATCH 0/14] amd64_edac: marry mcheck to amd64 edac
> To: mingo@xxxxxxx, hpa@xxxxxxxxx, tglx@xxxxxxxxxxxxx, norsk5@xxxxxxxxx, aris@xxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx, x86@xxxxxxxxxx
> Date: Monday, July 20, 2009, 10:12 AM
> Hi all,
>
> this is the first version of the attempt to forward MCE
> information to
> the amd64 EDAC module for further decoding. When the MCE
> handler gets
> invoked and the EDAC module is loaded, here's how a decoded
> MCE looks
> like:

This looks good. I will apply and test shortly.

Question: are you planning to have the ErrAddr decoding added later, where we decode to an actual DIMM label, as stored in the MCI structure for that error address?

If so, okay. If not, then we must have that to be displayed so the maintenance techs know exactly which DIMM to pull. Only the amd64 edac module has that and the controller registers to properly decode it.

the MCE has a poller thread as well for CORRECTED errors. Its cycle is abt 5 minutes I believe, while EDAC is 1 second. That is another item we need to sort out

thanks

doug t

>
> Disabling lock debugging due to kernel taint
>
> <0>HARDWARE ERROR
> CPU 3: Machine Check Exception:       
>         4 Bank 0: b20040001c000175
> TSC 714e9b73cf
> PROCESSOR 2:100f22 TIME 1247237579 SOCKET 0 APIC 3
> MC0_STATUS: Uncorrected error, report: yes, MiscV: invalid,
> CPU context corrupt: yes
> Data Cache Error: Data/Tag Evict error.
> Transaction: Evict, Type: Data, Cache Level: L1
> This is not a software problem!
> <0>Run through mcelog --ascii to decode and contact
> your hardware vendor
> Machine check: Processor context corrupt
> Kernel panic - not syncing: Fatal machine check on current
> CPU
> Pid: 4817, comm: cc1 Tainted: G   M 
>      2.6.31-rc2-00218-g78848b0-dirty
> #42
> Call Trace:
> <#MC>  [<ffffffff8134a17a>]
> panic+0xaf/0x178
> [<ffffffff812b5d9e>] ? decode_mce+0x47e/0x540
> [<ffffffff81019210>] ? print_mce+0x90/0x110
> [<ffffffff810193e7>] mce_panic+0x157/0x180
> [<ffffffff81019de7>] do_machine_check+0x757/0x930
> [<ffffffff8134d96d>] ?
> trace_hardirqs_off_thunk+0x3a/0x3c
> [<ffffffff8134e9cb>] machine_check+0x1b/0x20
> <EOE>
>
> Clearly, the "Run through mcelog... " line is redundant now
> :) since
> there's no need for userspace decoding anymore and the
> original EDAC
> functionality (polling workqueue) is still preserved. The
> code currently
> uses EDAC to decode DRAM ECC errors but this could clearly
> be extended
> to handle all valid addresses acquired from MCi_ADDR
> registers.
>
> Comments and further suggestions are most welcome.
>
> Thanks,
> Boris.
>
> arch/x86/kernel/cpu/mcheck/mce.c    | 
>   7 +
> drivers/edac/amd64_edac.c       
>    |  484
> +++++++++++++++++++++--------------
> drivers/edac/amd64_edac.h       
>    |   67 ++---
> drivers/edac/amd64_edac_dbg.c   
>    |    2 +-
> drivers/edac/amd64_edac_err_types.c |  126
> +++++-----
> 5 files changed, 382 insertions(+), 304 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/