Re: [PROPOSAL] Coping with random bit errors

Dean Gaudet (dgaudet-list-linux-kernel@arctic.org)
Fri, 10 Oct 1997 18:09:16 -0700 (PDT)


Even when you have ECC it's worthwhile for an idle thread to go through
ram doing a read and writeback of every byte. This allows single-bit
errors to be corrected before they become two-bit errors.

-- 
Dean Gaudet, Performance Analyst, Transmeta Corp.

On Sat, 11 Oct 1997, Richard Gooch wrote:

> Hi, all. Over the last couple of weeks, I've been running 2.1.57 and > have noticed random bit errors in files kept in the page cache. I > usually find them when compiling my code (I have a large code tree > which is actively developed), although the bit errors are in files I > don't even change. The way I solve this is to run a simple W/R memory > tester which causes the page cache to be (mostly) flushed. Then I can > recompile and everything works fine again. > I'm now dropping back to 2.1.42 for a while to see if the problem > recurrs, but I suspect the problem is hardware (alpha particles > flipping bits in RAM): I don't have parity RAM (couldn't afford it). > > The question I have is whether it would be possible (reasonable?) to > implement a daemon (possibly a kernel daemon) which maintains a > checksum hash of each page in the page cache which has not been > dirtied. The daemon would periodically (only when the system is idle) > regenerate checksums for pages and compare them with it's internal > database. If an undirtied page has a different checksum, the page is > declared corrupted: a log message is generated (for later statistics > collection) and the page is maked invalid/free. This will force the > kernel to reload the page from disc next time it is required. > Obviously, when a page is dirtied the daemon would have to remove it > from its database. It might also be an advantage to prevent the normal > modification of the last used time for the page when the daemon > accesses it. > > While the above scheme is not as robust as proper ECC memory, it has > the distinct advantage of being cheap (free:-) and should provide some > level of protection against random bit errors. I shudder to think what > other bit errors have crept into my source tree which don't prevent > compiling :-( > Anyway, I'd like to get some reaction from those who know more about > the page cache implementation as to what they think of this idea? > > Regards, > > Richard.... >