Re: NMI errors in 2.0.30??

Richard Gooch (rgooch@atnf.CSIRO.AU)
Sun, 27 Apr 1997 21:05:24 +1000


Richard B. Johnson writes:
[...]
> So what causes the RAM errors? The most common cause is bad RAM. But
> there are other problems that make perfectly good RAM produce errors
> that normal memory testing routines don't find. Most memory testing
> consists of writing patterns to RAM and then reading it back. If
> the result is what was written, the RAM is presumed good.
> Unfortunately, this tests very little. Timing problems with addressing
> can cause data written to a single memory location to also be written to
> other memory locations! To test this, you would have to write a pattern to
> ALL of RAM, then modify a single bit somewhere, then read ALL of RAM
> to make sure that only that bit was modified. Then you do the next bit.
>
> This would take weeks to test a few megabytes of RAM!

A few years back I wrote a memory tester because of some bad RAM. I
bought the machine in February (our summer) and all was fine. As
winter approached, I would come in of a morning and find that some
programmes didn't work properly. After rebooting, all was fine, until
next morning. Leave the air conditioner on heat overnight made the
problem go away.
The memory tester showed that a few bits would loose their correct
state in the early hours of the morning. What it did was write a
pattern to all of available memory, and periodically (say every hour)
check that pattern. The pattern was derived from the virtual memory
address of each word, XORed with a fixed word. This scheme, while not
completely robust against the situation you outline above, does help
to a reasonable extent. It has the benefit of being able to make a
single pass through memory very quickly. Anyway, I've appended the
code for said memory tester below. No manual page, but if enough
people ask for it I could be persuaded to write one. Even better if
someone writes one for me:-)

[...]
> All is __NOT__ lost! Many RAM problems that didn't show up in simple
> RAM test programs in PCs turn out to be caused by over temperature caused
> by the fan in the power supply sticking. Also dirt (lint) accumulating
> in the power supply will prevent air-flow inside the computer box. This
> causes problems will all the chips, not just RAM.

Late last year I switched to a style of case which features an
additional fan at the front, sucking air in through a dust filter. I'm
interested in seeing how grotty the insides look after a year or so...

Regards,

Richard....

================mem_test.c=====================================================
/* mem_test.c

Source file for mem_test (find bad memory).

Copyright (C) 1995-1997 Richard Gooch

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

Richard Gooch may be reached by email at rgooch@atnf.csiro.au
The postal address is:
Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia.
*/

/* This programme will find bad memory.

Written by Richard Gooch 2-JUL-1995

Last updated by Richard Gooch 27-APR-1997

*/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define MBYTE 1048576
#define WORDS_PER_MBYTE 1048576 / sizeof (unsigned long)
#define TRUE 1
#define FALSE 0

static void thrash_test (unsigned int num_words, unsigned long *array);
static void idle_test (unsigned int num_words, unsigned long *array,
int interval);

void main (int argc, char **argv)
{
int idle = FALSE;
int interval;
unsigned int num_mbytes;
unsigned int num_words;
unsigned long *array;
static char usage_string[] = "Usage:\tmem_test [-idle interval] #MBytes";

if ( (argc != 2) && (argc != 4) )
{
fprintf (stderr, "%s\n", usage_string);
exit (1);
}
if (argc == 4)
{
if (strcmp (argv[1], "-idle") != 0)
{
fprintf (stderr, "%s\n", usage_string);
exit (1);
}
if ( ( interval = atoi (argv[2]) ) < 1 )
{
fprintf (stderr, "Bad interval: \"%s\"\n", argv[2]);
exit (2);
}
idle = TRUE;
argv += 2;
}
num_mbytes = atoi (argv[1]);
if ( ( array = (unsigned long *) malloc (MBYTE * num_mbytes) ) == NULL )
{
fprintf (stderr, "Error allocating\n");
exit (2);
}
num_words = num_mbytes * WORDS_PER_MBYTE;
fprintf (stderr, "Number of words: %u\n", num_words);
if (idle) idle_test (num_words, array, interval);
else thrash_test (num_words, array);
} /* End Function main */

static void thrash_test (unsigned int num_words, unsigned long *array)
{
unsigned int byte_count;
unsigned int word_count;
unsigned long xor_pattern = 0xab12cf9e;

while (TRUE)
{
fprintf (stderr, "Writing\t");
for (word_count = 0, byte_count = 0; word_count < num_words;
++word_count)
{
array[word_count] = word_count ^ xor_pattern;
byte_count += sizeof (unsigned long);
if (byte_count >= MBYTE)
{
byte_count = 0;
fprintf (stderr, ".");
}
}
fprintf (stderr, "\nReading\t");
for (word_count = 0, byte_count = 0; word_count < num_words;
++word_count)
{
if ( array[word_count] != (word_count ^ xor_pattern) )
{
fprintf (stderr, "Word[%d] = 0x%lx should be: 0x%lx\n",
word_count,
array[word_count], word_count ^ xor_pattern);
exit (3);
}
byte_count += sizeof (unsigned long);
if (byte_count >= MBYTE)
{
byte_count = 0;
fprintf (stderr, ".");
}
}
fprintf (stderr, "\n");
xor_pattern = ~xor_pattern;
}
} /* End Function thrash_test */

static void idle_test (unsigned int num_words, unsigned long *array,
int interval)
{
unsigned int byte_count;
unsigned int word_count;
unsigned long xor_pattern = 0xab12cf9e;

fprintf (stderr, "Will write once, read many times with a %d s", interval);
fprintf (stderr, " interval between scans\n");
fprintf (stderr, "Writing\t");
for (word_count = 0, byte_count = 0; word_count < num_words;
++word_count)
{
array[word_count] = word_count ^ xor_pattern;
byte_count += sizeof (unsigned long);
if (byte_count >= MBYTE)
{
byte_count = 0;
fprintf (stderr, ".");
}
}
fprintf (stderr, "\n");
while (TRUE)
{
fprintf (stderr, "Reading\t");
for (word_count = 0, byte_count = 0; word_count < num_words;
++word_count)
{
if ( array[word_count] != (word_count ^ xor_pattern) )
{
fprintf (stderr, "Word[%d] = 0x%lx should be: 0x%lx\n",
word_count,
array[word_count], word_count ^ xor_pattern);
exit (3);
}
byte_count += sizeof (unsigned long);
if (byte_count >= MBYTE)
{
byte_count = 0;
fprintf (stderr, ".");
}
}
fprintf (stderr, "\n");
sleep (interval);
}
} /* End Function idle_test */