Re: NMI errors in 2.0.30??

Rogier Wolff (R.E.Wolff@BitWizard.nl)
Sun, 27 Apr 1997 14:43:29 +0200 (MET DST)


Richard B. Johnson wrote:
>
> The kernel doesn't "know" anything about an ECC mode in the BIOS. The
> kernel presumes that all RAM found is good and whatever is written to
> the RAM can be read back exactly as written.
>
> Given that, a memory controller chip may detect a RAM parity error or
> the inability to correct a RAM error if it handles ECC, i.e., correctable
> errors. When it detects such an error, it signals the CPU via the non-
> maskable interrupt. Since the CPU can not do anything about a RAM error
> that has occurred, software can do different things once such an interrupt
> occurs. Windoze 95 issues an "inrecoverable error" message and prompts
> the user to "Continue or Reboot". NT just presumes the user is dumb and
> reboots. Linux knows that there isn't anything it can do about the
> problem and just issues an error message and continues. MS-DOS just
> ignores the problem unless a memory manager is installed. If the memory
> manager is installed, it clears the screen, makes some dumb message
> about "protecting you", then waits for a keypress before it reboots the
> system.
>
> In every case, there isn't really anything that the operating system
> can do to "recover" from a RAM error. In some machines like VAXen,
> the kernel will map out any bad RAM found. The task that was using
> this RAM gets killed, but the system continues. This area of RAM
> will not be reused until the system is rebooted. VAXen use 512-byte
> pages.
>
> More modern machines use 4096-byte (and larger) pages. It would be
> a shame to map out an entire page just because of an occasional
> RAM error in one of these pages. It could be done, however.
>
> So what causes the RAM errors? The most common cause is bad RAM. But
> there are other problems that make perfectly good RAM produce errors
> that normal memory testing routines don't find. Most memory testing
> consists of writing patterns to RAM and then reading it back. If
> the result is what was written, the RAM is presumed good.
> Unfortunately, this tests very little. Timing problems with addressing
> can cause data written to a single memory location to also be written to
> other memory locations! To test this, you would have to write a pattern to
> ALL of RAM, then modify a single bit somewhere, then read ALL of RAM
> to make sure that only that bit was modified. Then you do the next bit.

Nice story so far, but now you're claiming incorrect things.

There are a set of algorithms that are called "marching tests". The
simplest is "mats" and is commonly written as {up_W0, up_R0W1, updn_R1}
This test detects the simplest type of errors, and not much more.
This test does not find all coupling errors that you describe.

A more sofisticated test like marchg finds all errors, upto all the
single coupling errors that may exist.

MarchG is (cut-and-paste from a memory test program that I'm writing):

MARCH(UPDN,W0, TT,order,stride); \
MARCH(UP ,R0W1R1W0R0W1,TT,order,stride); \
MARCH(UP ,R1W0W1, TT,order,stride); \
MARCH(DOWN,R1W0W1W0, TT,order,stride); \
MARCH(DOWN,R0W1W0, TT,order,stride); \
DELAY; \
MARCH(UPDN,R0W1R1, TT,order,stride); \
DELAY; \
MARCH(UPDN,R1W0R0, TT,order,stride);

This is supposed to also find errors caused by the refresh not working
for some memory cells. That is what the "DELAY" things are for.

> This would take weeks to test a few megabytes of RAM!
>
> There is also something called pattern sensitivity. Lets say that you
> read RAM in 0x1000 byte blocks (a page on the ix86). Lets say the first page
> was filled with 0xffffffff and the next page was filled with 0x00000000.
> Suppose that you read these two pages over and over in a continuous loop
> and the loop takes 1 ms to execute. It takes a different amount of
> current from the power supply to access a bunch of 0xffffffffs than
> it does to access a bunch of 0x00000000s. This would put a 1 kHz load
> change on the power supply. What happends if the power supply has a
> 1 kHz overshoot?? The voltage will bounce at a 1 kHz rate. If it gets
> out of range during this modulation, RAM (any RAM anywhere) could lose
> its data.

Hey, this is indeed a possible scenario of why memory tests cannot
find some errors, while kernel compiles do.

> A 1000+ word paper could be written (probably has been) on testing
> RAM and the causes and cures of RAM errors.

There is a 300+ page book called "testing semiconductor memories" by
A.J. van de Goor that explains this reasonably well....

Roger.