Re: [2.6.1 MCE falseness?] Hardware reports non-fatal error

From: glee
Date: Sun Jan 18 2004 - 09:25:30 EST


On Sun, Jan 18, 2004 at 02:30:48PM +0100, Pedro Larroy wrote:
> I also have been getting apparently false MCEs since 2.5.xx
> I even had kernel panics in early 2.5 with MCE enabled. Now in 2.6.0-xx
> and in 2.6.1 I just get them from time to time but none fatal.
> most of the time in CPU 0
>


I get them too, so I applied this patch.


- g.

--- linux-2.6.0-test9/arch/i386/kernel/cpu/mcheck/non-fatal.c.orig 2003-11-02 13:31:43.000000000 +0800
+++ linux-2.6.0-test9/arch/i386/kernel/cpu/mcheck/non-fatal.c 2003-11-02 21:50:36.000000000 +0800
@@ -21,6 +21,7 @@

static struct timer_list mce_timer;
static int timerset;
+static int startbank;

#define MCE_RATE 15*HZ /* timer rate is 15s */

@@ -30,7 +31,7 @@
int i;

preempt_disable();
- for (i=0; i<nr_mce_banks; i++) {
+ for (i=startbank; i<nr_mce_banks; i++) {
rdmsr (MSR_IA32_MC0_STATUS+i*4, low, high);

if (high & (1<<31)) {
@@ -68,6 +69,19 @@

static int __init init_nonfatal_mce_checker(void)
{
+ /*
+ Certain Athlons would cause spurious MCE non-fatal
+ exception check to be reported, if we poke at bank 0.
+ Avoid this if the running machine is an Athlon.
+
+ -- geoff.
+ */
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
+ boot_cpu_data.x86 == 6)
+ startbank = 1;
+ else
+ startbank = 0;
+
if (timerset == 0) {
/* Set the timer to check for non-fatal
errors every MCE_RATE seconds */