Re: Frequent oops in shrink_mmap

Stephen C. Tweedie (sct@redhat.com)
Tue, 30 Nov 1999 14:30:52 +0000 (GMT)


Hi,

On Mon, 29 Nov 1999 17:09:12 -0800, David desJardins <desj@google.com>
said:

> Thanks very much; this is really helpful.

> I looked at 56 of these oops messages in try_to_free_buffers, from 10
> machines. 50 messages (4 machines) have %eax=80000000, and 6 messages
> (6 machines) have %eax=40000000. Is this consistent with the single-bit
> memory error, or not?

Yep. If there is a design flaw on your motherboards, for example,
such that the timing of the outside signals on the bus is borderline,
then this is exactly what you would see.

In cases like these it is always hard to determine absolutely for sure
whether it is hardware or software, but the fact that you see this on
only a fixed subset of otherwise identical machines argues strongly
for hardware.

There is also the fact that the corruptions are coming from a statically
allocated area of memory which is simply never, ever used for kmalloc
allocations, and the vast, vast, vast majority of software-triggered
memory corruptions are due to accesses to memory which has been
allocated dynamically and then freed early. That just doesn't fit the
pattern (and I can see no way a corruption to the buffer ring links in a
normal buffer head could get propagated back to the static page array
without triggering an oops elsewhere).

My bet would still be on hardware.

> If it's purely a hardware problem, should I be seeing 20000000 and
> 10000000 and other one-bit patterns?

Not if there is a design flaw in the memory, motherboards or chipset.

> And should I be seeing one-bit differences from valid nonzero
> pointers? Or is it the case that only memory errors in the top two
> bits will trigger this oops, and other memory errors might remain
> undetected

Memory errors elsewhere in those words are still likely to generate
oopses, yes: any non-zero value in page->buffers is going to get
dereferenced in try_to_free_buffers, and will most likely cause an oops
in there somewhere (if not at the first dereference, then later when we
follow the chain of per-page buffer_heads).

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/