Re: correcting one-bit errors in the idle task

Neil Conway (nconway.list@ukaea.org.uk)
Tue, 20 Oct 1998 08:30:06 +0000


Daniel Quinlan wrote:
>
> This might make an interesting and useful small project for someone
> with a little free time.
>
> My understanding is that ECC corrects one-bit errors only when a
> memory location is read. It would be very cool to have a kernel
> thread (perhaps in the idle task) that slowly goes through physical
> memory reading every byte. That would allow one-bit errors to be
> corrected before they become two-bit errors.
>
> Cycling through memory relatively infrequently (on the order of once
> an hour, maybe even one or two times a day) would probably be often
> enough, so cache pollution wouldn't be a concern.

I've been meaning to ask questions about this... We have a number of
machines here with lots of ECC memory, some of which have scrubbing
logic in the chipset (BX) and some of which don't (LX,FX).

My first question is: are there any tools (esp. for Linux :) to ask the
chipset how *many* single-bit errors have occurred ?

If the average time between errors is 1000days then maybe it's not a
worry ;-) Of course, if it's 24 hours then one might worry a little...

Anyway, I've just reread what Daniel wrote and a little clarification is
required: *reading* the location only fixes the one-bit errors on some
chipsets, not all. Writing the same data back again would fix the
error, but that would be a little more dangerous and would require a
slightly more sophisticated daemon, but nothing we (I laughingly include
myself) kernel hackers can't handle.

P-Pro chipsets like the 440FX don't scrub errors, neither does the PII
LX. The BX does though...

Neil
E&OE :-)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/