Re: 2.2.10 oops (finally, something I can report!)

Steven N. Hirsch (shirsch@adelphia.net)
Thu, 1 Jul 1999 07:55:26 -0400 (EDT)


On Wed, 30 Jun 1999, Nate Eldredge wrote:

> Linus wrote:
> >
> > The thing that does NOT make sense is the cause of the oops itself,
> > though.
> >
> > The oops happens on
> >
> > c017b651 pushl %ebx
> >
> > and %esp = c3941e80.
> >
> > And quite frankly, there's not a way in h*ll that that instruction could
> > raise the exception in question. But it does.
> >
> > I would _strongly_ suspect one of two things:
> > - bad CPU.
> > - bad cache or RAM timings.
>
> I had a Cyrix CPU some time back that had a *very* similar problem. I
> believe it was running 2.0.36. Anyway, it worked absolutely fine, until
> one day I built EGCS. This binary would, about 1/3 of the time, crash.
> Poking around with a debugger showed that the instruction on which it
> crashed was an access to a perfectly valid address (according to
> /proc/xx/maps). Swapping in a different CPU (I think it was an Intel
> Pentium) fixed it. ISTR it also could be fixed by turning off the L1
> cache or something equally unacceptable performance-wise.

I'll provide another data point on this issue. For about two years, my
Cyrix P150+ box would crash 1 out of 3 times during kernel builds with
spurious signal 11's. No rhyme or reason - the location was random and
non-deterministic.

Finally, after a suggestion from Alan Cox, I picked up a Pentium 166 and
replaced the CPU. Haven't seen so much as a hiccup from the box since
then (about 8 months now).

This is almost certainly a hardware problem.

Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/