Re: .96/.97 : random crashes (detailed Oops report)

Paul Matthews (paul@matthews.com)
Sun, 12 May 1996 15:11:12 -0400


All,

FWIW, I have a similar problem. It occurred with linux-1.3.100 and
with linux-pre2.0.1.

I have never seen this problem before. I find the following:

1. The system proceeds thru regular boot-up sequence until it gets
beyond installing the IDE drivers (I think). This is the regular
boot-up sequence with linux-1.3.98, and, as far as I can tell, it is
very similar with the linux-pre2.0.1 release:

+++++++++++++++++++++++++++++++++++++++++++++++++++
ide_setup: ide0=dtc2278
Console: 16 point font, 400 scans
Console: colour VGA+ 80x25, 1 virtual console (max 63)
Calibrating delay loop.. ok - 39.94 BogoMIPS
Memory: 31200k/32768k available (500k kernel code, 384k reserved, 684k data)
This processor honours the WP bit even when in supervisor mode. Good.
Swansea University Computer Society NET3.034 for Linux 1.3.77
NET3: Unix domain sockets 0.12 for Linux NET3.033.
Swansea University Computer Society TCP/IP for NET3.034
IP Protocols: IGMP, ICMP, UDP, TCP
Checking 386/387 coupling... Ok, fpu using exception 16 error reporting.
Checking 'hlt' instruction... Ok.
Linux version 1.3.98 (root@gw1) (gcc version 2.7.2) #3 Fri May 10 19:57:04 EDT 1996
Serial driver version 4.12 with no serial options enabled
tty00 at 0x03f8 (irq = 4) is a 16550A
tty01 at 0x02f8 (irq = 3) is a 16550A
tty02 at 0x03e8 (irq = 4) is a 16450
Real Time Clock Driver v1.05
hda: Maxtor 7540 AV, 515MB w/32kB Cache, LBA, CHS=1046/16/63
hdb: Maxtor 7850 AV, 814MB w/64kB Cache, LBA, CHS=1654/16/63
hdc: FX400_02, ATAPI CDROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
++++++++++++++++++++++++++++++++++++++++++++++++++

[NOTE: It never gets to the next line under the pre2.0.1 release.]

ide1 at 0x170-0x177,0x376 on irq 15 (serialized with ide0)
smc-ultra.c:v1.12 1/18/95 Donald Becker (becker@cesdis.gsfc.nasa.gov)
eth0: SMC Ultra at 0x300, 00 00 C0 89 15 8F, IRQ 11 memory 0xd0000-0xd3fff.
++++++++++++++++++++++++++++++++++++++++++++++++++

[Misc. lines omitted.]
Then, I get the following messages:
++++++++++++++++++++++++++++++++++++++++++++++++++

Unable to handle kernel NULL pointer
dereference at virtual address c000000d6
current->tss.cr3 = 00101000, %cr3 = 00101000
*pde = 00102067
*pte = 00000027
Oops: 0000
CPU: 0
++++++++++++++++++++++++++++++++++++++++++++++++++

[Misc. lines omitted.]

I am running an AMD 486-DX2/80 chip with 32-MB RAM. This
system has never had problems like this before. I do not have a
bad external cache, as far as I can tell. It is interesting that the
addresses above are very similar to the ones reported below.

This problem is consistent and repeatable. Is there anything else
that I can report that might be helpful?

Regards,
Paul Matthews
McLean, VA
e-mail: paul@matthews.com
++++++++++++++++++++++++++++++++++++++++++++++++++

Linus Torvalds writes:
> > I have been getting random crashes lately (some with Oops info, some
> > plain reboot. Bad external cache might have interfered, too... :-( ).
> >
> > Here is what I was able to grab from the error log :
> >
> > Unable to handle kernel NULL pointer dereference at virtual address
> > c000000008] current->tss.cr3 = 00333000, %cr3 = 00333000
>
> The above looks like a cut-and-paste error, I assume that the faulting address
> was 0xc0000000 and that the "8]" is some corruption (the "NULL pointer" thing
> only happens for addresses 0xc0000000-0xc0000fff, so regardless of the
> corruption the rest of this panic looks strange..)
>
> > *pde = 00102067
> > *pte = 00000027
> > Oops: 0002
> > CPU: 0
> > EIP: 0010:[<00120290>]
> > EFLAGS: 00010046
> > eax: 00000000 ebx: 001e3778 ecx: 00000000 edx: 00000296
> > esi: 00000001 edi: 001bc514 ebp: 00000000 esp: 0117df1c
> > ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
> > Process cpp (pid: 945, process nr: 38, stackpage=0117d000)
> > Stack: 00f58618 40036d70 0117dfbc ffff0007 001bc514 00000000 00000293
> > 00119952 00000003 00000000 00000000 00f58618 40036d70 0117dfbc
> > ffff0007 00f58618 40036d70 0117dfbc ffff0004 001e37ac 00000000
> > 0011cba0 00a33025 00105025 Call Trace: [<00119952>] [<0011cba0>]
> > [<0010ff3e>] [<0010fe00>] [<0010a79b>] Code: 89 54 24 14 8b 43 10 8b 4c 24
> > 10 89 41 10 89 48 20 8d 45 01
> >
> > >>EIP: 120290 <__get_free_pages+c0/1c0>
> > Trace: 119952 <do_wp_page+12/2b0>
> > Trace: 11cba0 <filemap_nopage>
> > Trace: 10ff3e <do_page_fault+13e/2d0>
> > Trace: 10ff3e <do_page_fault+13e/2d0>
> > Trace: 10a79b <error_code+4b/60>
> >
> > Code: 120290 <__get_free_pages+c0/1c0> movl %edx,0x14(%esp,1)
> > Code: 120294 <__get_free_pages+c4/1c0> movl 0x10(%ebx),%eax
> > Code: 120297 <__get_free_pages+c7/1c0> movl 0x10(%esp,1),%ecx
> > Code: 12029b <__get_free_pages+cb/1c0> movl %eax,0x10(%ecx)
> > Code: 12029e <__get_free_pages+ce/1c0> movl %ecx,0x20(%eax)
> > Code: 1202a1 <__get_free_pages+d1/1c0> leal 0x1(%ebp),%eax
> >
> > This is ca page_alloc.c:146...
>
> Yes. It's also another of those impossible things.. The above instruction
> should _not_ result in a NULL pointer error, as esp is a perfectly fine kernel
> pointer (and it's in the kernel stack page, so it looks fine: 0x0117df1c). What
> kind of CPU do you have? Could it be the same old Cyrix bug in a new disguise?
>
> The panic report looks otherwise sane: the above instruction sequence is indeed
> a normal part of __get_free_pages(), and the stack trace also makes sense, as
> do the registers in the register dump (faulting instruction is essentially the
> "unsigned long mapnr = ret->mapnr;" assignment, and "mapnr = %edx = 296" makes
> sense: it means it has found a free page at around the 2MB mark..)
>
> Ho humm.. Any idea of what could have been going on when this happened? Any
> pattern at all to it? The above doesn't look like a cache corruption problem,
> unless your icache might have gotten corrupted some way.
>
> Linus