Re: Buffer corruption (2.1.81)

Joseph H. Buehler (jhpb@sarto.gaithersburg.md.us)
26 Jan 1998 08:00:10 -0500


Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de> writes:

> When compiling a kernel, it reported an error in a file that compiled
> fine twenty minutes ago. When looking at the source, I see that 4 bytes
> are magically changed, so I suspect the buffers were corrupted. The
> corrupted word was at offset 13300 of the file. I'm running
> 2.1.81+linux-2.1.81.diff.gz. There were no syslog reports.
>
> It might be a hardware problem, I did not run any analysis tool yet.
> The machine is a AMD 486 with 40MB memory, an Adaptec 2940 controller
> and a Teles/16.2.
>
> If someone can recommend a procedure to analyse such problems if they
> occur again, please let me know. I had binaries suddenly crash with
> earlier kernel versions as well. After rebooting, the files were in
> their original state.

I have seen the same thing under redhat 5.0 on a dual ppro. I was
trying to compile egcs, and the build kept crashing with various fatal
signals. Usually in the linker, but every now and then in the
compiler. Looking at the source when it was a compile error, the file
was corrupted. But the corruption did not appear to be on disk; under
circumstances that I don't recall, rereading the file in emacs gave me
back correct contents.

I upgraded the kernel using a 2.0.33-5 source rpm residing on
redhat.com, it fixed nothing. I tried commenting out SMP=1 and
running a uniprocessor kernel, it also changed nothing.

I started to suspect SCSI problems; I tried the egcs compile on my
root disk instead of my second disk, and it worked on the first
attempt.

So I wrote a simple disk exerciser to write 512 byte blocks of
pseudo-random data at pseudo-random block offsets in a 900 MB file.
It took 10 hours to write the file; the program has been reading it
all back and checking the block contents for about 13 hours now,
without errors. It doesn't look like a SCSI termination problem.

I will probably do a file/directory create/delete test next, and see
if that causes problems. Sure looks like a kernel bug of some kind...

Joe Buehler