How To Debug Kernel Oops's

Jordan Mendelson (jordy@wserv.com)
Mon, 12 Apr 1999 20:06:50 -0400


Hello all,

I've searched the mailing list archives and various web sites, but I
couldn't find a lot of good information, so I've decided to post. I've
read that Linus and others won't even bother reading Oops's unless they
are run through ksymoops or the equivilent, however it's fairly apparent
that even if you do this and post to linux-kernel, there isn't much
chance of anyone responding or helping you out... so I'd like to get
some information on debugging an Oops myself.

As far as I can tell there are several paths I can go. I hooked up a
serial console so I could do some debugging. I understand I can use
gdbstub from http://www.gcom.com to do remote debugging via the serial
console and get basically the same functionality as gdb.

However, I'm a bit unclear on how to go about debugging a typical kernel
Oops. Here are the two Oopses I've recently gotten with 2.2.5 on an SMP
box:

Unable to handle kernel NULL pointer dereference at virtual address
00000018
current->tss.cr3 = 0febc000, %cr3 = 0febc000
*pde = 00000000
Oops: 0000
CPU: 1
EIP: 0010:[<c011e913>]
EFLAGS: 00010202
eax: 00001240 ebx: 000e9000 ecx: c0229160 edx: 00000010
esi: c2f65ba0 edi: c99cf000 ebp: 000ea000 esp: cf26df18
ds: 0018 es: 0018 ss: 0018
Process squid (pid: 400, process nr: 20, stackpage=cf26d000)
Stack: c03e0380 000ea000 c0229160 c011ede2 c20ad8a0 000ea000 00000000
00001000
0c9385d8 00000000 cf26c000 c022915c 00001000 000e9000 00003000
00012000
00000000 00000001 00000001 000e9000 c2f65ba0 c011f197 c20ad8a0
c20ad8b4
Call Trace: [<c011ede2>] [<c011f197>] [<c011f0e4>] [<c01273aa>]
[<c0108be8>]
Code: 39 72 08 75 f4 39 6a 0c 75 ef f0 ff 42 14 b8 02 00 00 00 f0

>>EIP; c011e913 <try_to_read_ahead+7b/11c>
Trace; c011ede2 <do_generic_file_read+2fa/5fc>
Trace; c011f197 <generic_file_read+63/7c>
Trace; c011f0e4 <file_read_actor+0/50>
Trace; c01273aa <sys_read+c2/e8>
Trace; c0108be8 <system_call+34/38>
Code; c011e913 <try_to_read_ahead+7b/11c> 00000000 <_EIP>:
Code; c011e913 <try_to_read_ahead+7b/11c> 0: 39 72 08
cmpl %esi,0x8(%edx)
Code; c011e916 <try_to_read_ahead+7e/11c> 3: 75 f4
jne fffffff9 <_EIP+0xfffffff9> c011e90c <try_to_read_ahead+74/11c>
Code; c011e918 <try_to_read_ahead+80/11c> 5: 39 6a 0c
cmpl %ebp,0xc(%edx)
Code; c011e91b <try_to_read_ahead+83/11c> 8: 75 ef
jne fffffff9 <_EIP+0xfffffff9> c011e90c <try_to_read_ahead+74/11c>
Code; c011e91d <try_to_read_ahead+85/11c> a: f0 ff 42 14
lock incl 0x14(%edx)
Code; c011e921 <try_to_read_ahead+89/11c> e: b8 02 00 00 00
movl $0x2,%eax
Code; c011e926 <try_to_read_ahead+8e/11c> 13: f0 00 00
lock addb %al,(%eax)

and

Unable to handle kernel NULL pointer dereference at virtual address
00000018
current->tss.cr3 = 01672000, %cr3 = 01672000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c011e7e2>]
EFLAGS: 00010202
eax: c03df908 ebx: c8873760 ecx: 00000010 edx: 0001910e
esi: 00000000 edi: 40009000 ebp: 00000400 esp: c8d6bea0
ds: 0018 es: 0018 ss: 0018
Process webalizer (pid: 4519, process nr: 30, stackpage=c8d6b000)
Stack: 40009000 c8873760 00000010 0c887376 c013e594 c8873760 0000c000
c7cffc00
00000400 c65c1560 ffffffea c88737ac c8d6a000 c632e080 00000400
0000c000
00000000 c8d6bf08 00000212 c4090000 00000000 00000000 c0fd6a00
00000000
Call Trace: [<c013e594>] [<c01249e5>] [<c011c34a>] [<c011dca9>]
[<c011df52>] [<c01274c1>] [<c013e2e8>]
[<c0108be8>]
Code: 39 59 08 75 e1 8b 5c 24 20 39 59 0c 75 d8 f0 ff 41 14 b8 02

>>EIP; c011e7e2 <update_vm_cache+7a/130>
Trace; c013e594 <ext2_file_write+2ac/4f8>
Trace; c01249e5 <free_page_and_swap_cache+5d/60>
Trace; c011c34a <zap_page_range+13e/1bc>
Trace; c011dca9 <unmap_fixup+109/118>
Trace; c011df52 <do_munmap+212/228>
Trace; c01274c1 <sys_write+f1/124>
Trace; c013e2e8 <ext2_file_write+0/4f8>
Trace; c0108be8 <system_call+34/38>
Code; c011e7e2 <update_vm_cache+7a/130> 00000000 <_EIP>:
Code; c011e7e2 <update_vm_cache+7a/130> 0: 39 59 08
cmpl %ebx,0x8(%ecx)
Code; c011e7e5 <update_vm_cache+7d/130> 3: 75 e1
jne ffffffe6 <_EIP+0xffffffe6> c011e7c8 <update_vm_cache+60/130>
Code; c011e7e7 <update_vm_cache+7f/130> 5: 8b 5c 24 20
movl 0x20(%esp,1),%ebx
Code; c011e7eb <update_vm_cache+83/130> 9: 39 59 0c
cmpl %ebx,0xc(%ecx)
Code; c011e7ee <update_vm_cache+86/130> c: 75 d8
jne ffffffe6 <_EIP+0xffffffe6> c011e7c8 <update_vm_cache+60/130>
Code; c011e7f0 <update_vm_cache+88/130> e: f0 ff 41 14
lock incl 0x14(%ecx)
Code; c011e7f4 <update_vm_cache+8c/130> 12: b8 02 00 00 00
movl $0x2,%eax

Now, the trace seems self-explanatory. I need to brush up on my ASM
(more like relearn it it seems) but for the most part I understand the
code part as well. Is the trace and code sections supposed to be read
from bottom to top or top to bottom? Gdb backtrace is bottom to top, but
the code section looks like top to bottom.

I'm still not clear on what the numbers next to the function name are, I
assumed they were line numbers... but I checked the source and they
don't seem to corrispond.

Typically, given the above kernel oops, how would one go about finding
the bug. Open up each function in the trace and see if there are any
blatently obvious problems and if not stick a printk() in it and wait
for an oops?

Are there any good books on debugging the Linux kernel. With an average
program it's easy, wait for a seg fault or insert some breakpoints
something is goign wrong, and run the program then get a backtrace with
complete pointers to the source including line numbers and function
arguments.

Do any other Unices have any more advanced crash analysis techniques?
I've read something about a kernel core file for some OS's.

Jordan

--
Jordan Mendelson     : http://jordy.wserv.com
Web Services, Inc.   : http://www.wserv.com

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/