Re: Crash report #1 with SADISTIC_KMALLOC/no modules

Linus Torvalds (
Mon, 8 Apr 1996 09:20:21 +0300 (EET DST)

On Sun, 7 Apr 1996, Morten Welinder wrote:
> > Ok, so far so good. HOWEVER, that instruction then traps with:
> >
> > Unable to handle kernel paging request at virtual address 00000004
> >
> >even though virtual address 0x00000004 never even enters the picture. In
> >short, that particular instruction should under no circumstances be able
> >to trap with that address.
> This is not quite true. I can see at least two possibilities where
> the instruction could validly generate the page fault:
> 1. The page directory or some page table contains bogus information.
> This includes entries not flushed correctly.

Not possible, unless the chip in question does other things wrong. The
%cr2 register contains the virtual faulting address, and all the page
table lookups on a x86 are physical (and thus the page table lookups can
never fault, like on some other architectures).

> 2. A non-sequential event (i.e., an interrupt) hides the real cause.

I thought about this, but if the 00000004 value comes from an interrupt
also taking a page fault then that other fault information should _also_
have come from that interrupt, but the stack dump gives us the rep movsl.

Also, interrupts should never cause a page-fault in the first place - if
they did you'd see those page faults a lot more often than you see the
occasional incorrect fault (essentially, a page fault in an interrupt
should always result in some kernel message, and to get this kind of
corruption you'd have to have the interrupt at exactly the right place).

Finally, if this was due to an interrupt or anything like that, it would
be less deterministic, I believe. But he had two faults (faulting in
completely different places) that showed the exact same behaviour.

> 3. Something trashes [part of] the register dump before it gets
> dumped.

This is certainly possible, but rather unlikely: the dump information all
looked very sane other than the incorrect fault information. I find it
unlikely that we would get such localized corruption (and only very
occasionally: the fault handler has obviously worked correctly for probably
millions of page faults, or he would never had been able to get into user
land and X in the first place).

> If I remember things right then CR2 contains a physical address, not
> a linear one. Something could also have gone wrong with the
> translation before the printing.

No, cr2 is the faulting virtual address, so that's not it. I agree that
the weird behaviour could potentially be from something else than a buggy
CPU, but considering that the machine in question has been showing
symptoms that others don't see for the whole 1.3.x series, I'm now
strongly suspecting hardware (I'm always suspicious of hardware, but I
want to make sure a linux bug can't possibly be the more likely cause

Does soembody else with Cyrix CPU's see strange behaviour like this?