Re: NMI errors in 2.0.30??

Richard B. Johnson (root@analogic.com)
Sat, 26 Apr 1997 21:40:44 -0400 (EDT)


The kernel doesn't "know" anything about an ECC mode in the BIOS. The
kernel presumes that all RAM found is good and whatever is written to
the RAM can be read back exactly as written.

Given that, a memory controller chip may detect a RAM parity error or
the inability to correct a RAM error if it handles ECC, i.e., correctable
errors. When it detects such an error, it signals the CPU via the non-
maskable interrupt. Since the CPU can not do anything about a RAM error
that has occurred, software can do different things once such an interrupt
occurs. Windoze 95 issues an "inrecoverable error" message and prompts
the user to "Continue or Reboot". NT just presumes the user is dumb and
reboots. Linux knows that there isn't anything it can do about the
problem and just issues an error message and continues. MS-DOS just
ignores the problem unless a memory manager is installed. If the memory
manager is installed, it clears the screen, makes some dumb message
about "protecting you", then waits for a keypress before it reboots the
system.

In every case, there isn't really anything that the operating system
can do to "recover" from a RAM error. In some machines like VAXen,
the kernel will map out any bad RAM found. The task that was using
this RAM gets killed, but the system continues. This area of RAM
will not be reused until the system is rebooted. VAXen use 512-byte
pages.

More modern machines use 4096-byte (and larger) pages. It would be
a shame to map out an entire page just because of an occasional
RAM error in one of these pages. It could be done, however.

So what causes the RAM errors? The most common cause is bad RAM. But
there are other problems that make perfectly good RAM produce errors
that normal memory testing routines don't find. Most memory testing
consists of writing patterns to RAM and then reading it back. If
the result is what was written, the RAM is presumed good.
Unfortunately, this tests very little. Timing problems with addressing
can cause data written to a single memory location to also be written to
other memory locations! To test this, you would have to write a pattern to
ALL of RAM, then modify a single bit somewhere, then read ALL of RAM
to make sure that only that bit was modified. Then you do the next bit.

This would take weeks to test a few megabytes of RAM!

There is also something called pattern sensitivity. Lets say that you
read RAM in 0x1000 byte blocks (a page on the ix86). Lets say the first page
was filled with 0xffffffff and the next page was filled with 0x00000000.
Suppose that you read these two pages over and over in a continuous loop
and the loop takes 1 ms to execute. It takes a different amount of
current from the power supply to access a bunch of 0xffffffffs than
it does to access a bunch of 0x00000000s. This would put a 1 kHz load
change on the power supply. What happends if the power supply has a
1 kHz overshoot?? The voltage will bounce at a 1 kHz rate. If it gets
out of range during this modulation, RAM (any RAM anywhere) could lose
its data.

A 1000+ word paper could be written (probably has been) on testing
RAM and the causes and cures of RAM errors.

All is __NOT__ lost! Many RAM problems that didn't show up in simple
RAM test programs in PCs turn out to be caused by over temperature caused
by the fan in the power supply sticking. Also dirt (lint) accumulating
in the power supply will prevent air-flow inside the computer box. This
causes problems will all the chips, not just RAM.

The temperature inside your box really should not be greater than about
45 degC. You should check this out.

Cheers,
Dick Johnson
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Richard B. Johnson
Project Engineer
Analogic Corporation
Voice : (508) 977-3000 ext. 3754
Fax : (508) 532-6097
Modem : (508) 977-6870
Ftp : ftp@boneserver.analogic.com
Email : rjohnson@analogic.com, johnson@analogic.com
Penguin : Linux version 2.1.35 on an i586 machine (66.15 BogoMips).
Warning : I read unsolicited mail for $350.00 per hour. Supply billing address.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-