Re: Huge uptimes & cosmic rays

Daniel Quinlan (quinlan@transmeta.com)
10 Jul 1997 00:31:47 -0700


Pavel Machek <pavel@Elf.mj.gts.cz> writes:

> Huge amount of memory & huge uptime is calling for trouble: remember
> that once or twice a year, random bit in your machine is selected &
> toggled. If it hits kernel or libc or long-lived daemon...

Good reason to use ECC DRAM. I have seen computations go wrong because
of one-off bit errors. In once case, a machine had ECC accidentally
disabled. Multiple runs of an intensive computation caused one-off bit
errors in different places, which disappeared when ECC was turned on.

I love that story... anyway, with ECC, the probability of these random
bit errors is significantly lower.

Finally, the probability of a memory error hitting the kernel or libc is
not much different on machines with huge amounts of RAM versus smaller
amounts of RAM. Relative to the total amount of RAM, the kernel takes
the same amount of RAM either way.

-- 
Daniel Quinlan (at work)        Linux, our last best hope for Unix
quinlan@transmeta.com           http://www.pathname.com/~quinlan/