Re: 2.2.15pre17 oops --- find_buffer+104/144

From: John Alvord (jalvo@mbay.net)
Date: Tue Apr 11 2000 - 10:07:49 EST


On Tue, 11 Apr 2000, Stephen C. Tweedie wrote:

> Hi,
>
> On Mon, Apr 10, 2000 at 10:50:40AM +0200, Romano Giannetti wrote:
> >
> > Maybe a bit flip (cosmic rays?). The machine is rock solid, normally.
> > Pentium III, only ide, no strange thing running (I mean, no VMware).
> >
> > EIP: 0010:[find_buffer+104/144]
> > eax: 00008000 ebx: 00000005 ecx: 0000b863 edx: 00008000
>
> _Every_ time I have tried to trace a bit-flip oops like that in
> find_buffer, it has turned out to be hardware. find_buffer seems
> to be about the most sensitive place in the kernel to that sort of
> error: the buffer cache code walks its linked lists a lot more than
> most other places in the kernel do, in my experience.
>
I spent some time talking with a astrophysics scientist when I worked at
IBM research. One of his experimental studies was measuring how many bit
flips would be experienced in DRAM. This was 1983 and he was talking about
64K bit memories. A typical PC with 1meg memory, would get one observed
error per year at sea level. There would be several more unobserved ones,
since writes would correct the problem. At Denver (about a mile high) the
rate doubled. There was less air in the way.

He had overseen a great experiment correlating observations of cosmic rays
at a large radio telescope in Arizona, with bit errors in some machines
that were installed in a Colorado ghost town.

He made a good point in favor of ECC memory, which can correct single bit
errors.

These days with increased memory density, the problem is probably worse.

john alvord

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Apr 15 2000 - 21:00:16 EST