Re: Linux 2.6.2, AMD kernel: MCE: The hardware reports a non fatal, correctable incident

From: Joseph Fannin
Date: Wed Mar 03 2004 - 23:07:23 EST


On Tue, Mar 02, 2004 at 09:55:54PM +0000, Dave Jones wrote:
> On Tue, Mar 02, 2004 at 07:00:16PM +0100, Davi Leal wrote:
>> What about this message?. Note that the system works. I have not had to
>> reboot. What meens the below message?.
>>
>
> The original plan behind that option was to find hardware faults early,
> but it seems to trigger a lot of false positives for various reasons.
> Part of this problem is that MCEs can also be generated on some hardware
> by doing something silly like reading from a reserved part of your
> motherboard chipset..

The MCE stuff truly did find a hardware fault early for me; my
Athlon system was MCE'ing and I ignored it, and later I got sig11
errors and fs corruption, which I finally traced to a failing stick
of memory.

> There are also CPU errata that can cause them to falsely trigger in
> some unusual cases, but I've not had time to go through the various
> errata datasheets to blacklist affected CPUs unfortunatly.
>
> I'm toying with the idea of marking it CONFIG_BROKEN for 2.6,
> and fixing it up later.

I wouldn't be so quick to write off MCEs as bugs or errata,
especially if the exceptions have only just begun showing up.
Running CPUBurn, memtest86 and the like is still probably a good
idea, especially if you value the data on your file system.

--
Joseph Fannin
jhf@xxxxxxxxxxxxxx

"Anyone who quotes me in their sig is an idiot." -- Rusty Russell.

Attachment: signature.asc
Description: Digital signature