Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

From: Matthias Fouquet-Lapar
Date: Thu Jan 29 2004 - 03:37:44 EST


> Andi> e.g. one bit ECC errors in memory are quite common. And with
> Andi> ECC memory they are not really fatal.
>
> Yet they are a good indicator that something is wrong (not performing
> properly) or may be failing soon. I don't think putting on blinders
> for such problems is a good idea. Though I agree that the question of
> how to report such things without needlessly alerting Joe Clueless is
> an interesting challenge.

We have done a rather large study with DIMMs that had SBEs and have
found no evidence that a SBE turns into a UCE, i.e. the fact that a SBE is
reported, is no indication that the device might fail soon.

As a matter of fact the soft error rates increases while parts use
smaller process technologies and lower supply voltages. Cosmic rays
are one source for soft errors. Another source are alpha particles
emitted by the solder.

Still I think it's important to log SBEs, but you probably will need
a treshhold in case you hit a hard SBE. Also scrubbing the memory location
(and re-read the location to check if the error was transient or not)
might be a good idea if the memory controller supports this.
If it is a true, hard SBE it should be reported. It also might be a good
idea to mark the page, so it does not get re-allocated.


Thanks

Matthias Fouquet-Lapar Core Platform Software mfl@xxxxxxx VNET 521-8213
Principal Engineer Silicon Graphics Home Office (+33) 1 3047 4127

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/