Re: .96/.97 : random crashes (detailed Oops report)

Linus Torvalds (torvalds@cs.helsinki.fi)
Mon, 6 May 1996 16:00:39 +0300 (EET DST)


> I have been getting random crashes lately (some with Oops info, some
> plain reboot. Bad external cache might have interfered, too... :-( ).
>
> Here is what I was able to grab from the error log :
>
> Unable to handle kernel NULL pointer dereference at virtual address
> c000000008] current->tss.cr3 = 00333000, %cr3 = 00333000

The above looks like a cut-and-paste error, I assume that the faulting address
was 0xc0000000 and that the "8]" is some corruption (the "NULL pointer" thing
only happens for addresses 0xc0000000-0xc0000fff, so regardless of the
corruption the rest of this panic looks strange..)

> *pde = 00102067
> *pte = 00000027
> Oops: 0002
> CPU: 0
> EIP: 0010:[<00120290>]
> EFLAGS: 00010046
> eax: 00000000 ebx: 001e3778 ecx: 00000000 edx: 00000296
> esi: 00000001 edi: 001bc514 ebp: 00000000 esp: 0117df1c
> ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
> Process cpp (pid: 945, process nr: 38, stackpage=0117d000)
> Stack: 00f58618 40036d70 0117dfbc ffff0007 001bc514 00000000 00000293
> 00119952 00000003 00000000 00000000 00f58618 40036d70 0117dfbc
> ffff0007 00f58618 40036d70 0117dfbc ffff0004 001e37ac 00000000
> 0011cba0 00a33025 00105025 Call Trace: [<00119952>] [<0011cba0>]
> [<0010ff3e>] [<0010fe00>] [<0010a79b>] Code: 89 54 24 14 8b 43 10 8b 4c 24
> 10 89 41 10 89 48 20 8d 45 01
>
> >>EIP: 120290 <__get_free_pages+c0/1c0>
> Trace: 119952 <do_wp_page+12/2b0>
> Trace: 11cba0 <filemap_nopage>
> Trace: 10ff3e <do_page_fault+13e/2d0>
> Trace: 10ff3e <do_page_fault+13e/2d0>
> Trace: 10a79b <error_code+4b/60>
>
> Code: 120290 <__get_free_pages+c0/1c0> movl %edx,0x14(%esp,1)
> Code: 120294 <__get_free_pages+c4/1c0> movl 0x10(%ebx),%eax
> Code: 120297 <__get_free_pages+c7/1c0> movl 0x10(%esp,1),%ecx
> Code: 12029b <__get_free_pages+cb/1c0> movl %eax,0x10(%ecx)
> Code: 12029e <__get_free_pages+ce/1c0> movl %ecx,0x20(%eax)
> Code: 1202a1 <__get_free_pages+d1/1c0> leal 0x1(%ebp),%eax
>
> This is ca page_alloc.c:146...

Yes. It's also another of those impossible things.. The above instruction
should _not_ result in a NULL pointer error, as esp is a perfectly fine kernel
pointer (and it's in the kernel stack page, so it looks fine: 0x0117df1c). What
kind of CPU do you have? Could it be the same old Cyrix bug in a new disguise?

The panic report looks otherwise sane: the above instruction sequence is indeed
a normal part of __get_free_pages(), and the stack trace also makes sense, as
do the registers in the register dump (faulting instruction is essentially the
"unsigned long mapnr = ret->mapnr;" assignment, and "mapnr = %edx = 296" makes
sense: it means it has found a free page at around the 2MB mark..)

Ho humm.. Any idea of what could have been going on when this happened? Any
pattern at all to it? The above doesn't look like a cache corruption problem,
unless your icache might have gotten corrupted some way.

Linus