Re: How To Debug Kernel Oops's

Keith Owens (kaos@ocs.com.au)
Tue, 13 Apr 1999 14:24:07 +1000

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Mike Galbraith: "Re: [real fix] Re: [patch] fix for cc1 Out of memory [Re:"
Previous message: Linus Torvalds: "Re: [real fix] Re: [patch] fix for cc1 Out of memory [Re: [TESTCASE] `Out of memory for cc1']"
Maybe in reply to: Jordan Mendelson: "How To Debug Kernel Oops's"

On Mon, 12 Apr 1999 20:06:50 -0400,
Jordan Mendelson <jordy@wserv.com> wrote:
>Call Trace: [<c013e594>] [<c01249e5>] [<c011c34a>] [<c011dca9>]
>[<c011df52>] [<c01274c1>] [<c013e2e8>]
> [<c0108be8>]
>Code: 39 59 08 75 e1 8b 5c 24 20 39 59 0c 75 d8 f0 ff 41 14 b8 02
>
>>>EIP; c011e7e2 <update_vm_cache+7a/130>
>Trace; c013e594 <ext2_file_write+2ac/4f8>
>Trace; c01249e5 <free_page_and_swap_cache+5d/60>
>Trace; c011c34a <zap_page_range+13e/1bc>
>Trace; c011dca9 <unmap_fixup+109/118>
>Trace; c011df52 <do_munmap+212/228>
>Trace; c01274c1 <sys_write+f1/124>
>Trace; c013e2e8 <ext2_file_write+0/4f8>
>Trace; c0108be8 <system_call+34/38>
>Code; c011e7e2 <update_vm_cache+7a/130> 00000000 <_EIP>:
>Code; c011e7e2 <update_vm_cache+7a/130> 0: 39 59 08
>cmpl %ebx,0x8(%ecx)
>Code; c011e7e5 <update_vm_cache+7d/130> 3: 75 e1
>jne ffffffe6 <_EIP+0xffffffe6> c011e7c8 <update_vm_cache+60/130>
>Code; c011e7e7 <update_vm_cache+7f/130> 5: 8b 5c 24 20
>movl 0x20(%esp,1),%ebx
>Code; c011e7eb <update_vm_cache+83/130> 9: 39 59 0c
>cmpl %ebx,0xc(%ecx)
>Code; c011e7ee <update_vm_cache+86/130> c: 75 d8
>jne ffffffe6 <_EIP+0xffffffe6> c011e7c8 <update_vm_cache+60/130>
>Code; c011e7f0 <update_vm_cache+88/130> e: f0 ff 41 14
>lock incl 0x14(%ecx)
>Code; c011e7f4 <update_vm_cache+8c/130> 12: b8 02 00 00 00
>movl $0x2,%eax
>
>Now, the trace seems self-explanatory. I need to brush up on my ASM
>(more like relearn it it seems) but for the most part I understand the
>code part as well. Is the trace and code sections supposed to be read
>from bottom to top or top to bottom? Gdb backtrace is bottom to top, but
>the code section looks like top to bottom.

ksymoops reads the stack trace and code fragment straight from input,
in the order they are printed. So it really depends on
arch/xxx/kernel/traps.c. For ix86, stack is latest first, code is
printed from the EIP onwards. Just to confuse the issue, some other
architectures print code either side of EIP, some even print code bytes
in words which have to be byte swapped to make any sense (see the -c
option in ksymoops/README).

>I'm still not clear on what the numbers next to the function name are, I
>assumed they were line numbers... but I checked the source and they
>don't seem to corrispond.

First figure is offset within proc, second is length of proc. By
default they are in hex but, because both gdb and klogd use decimal,
ksymoops has a -x toggle to switch between hex and decimal
offsets/lengths.

>Typically, given the above kernel oops, how would one go about finding
>the bug. Open up each function in the trace and see if there are any
>blatently obvious problems and if not stick a printk() in it and wait
>for an oops?

Deduction. From the Oops decide which register is at fault. Use gdb
to dissassemble the proc, see how that register got loaded. Work out
which structure or passed variable is incorrect. Then try to find out
how the bad value was created in the first place. The last step is the
hardest.

>Are there any good books on debugging the Linux kernel. With an average
>program it's easy, wait for a seg fault or insert some breakpoints
>something is goign wrong, and run the program then get a backtrace with
>complete pointers to the source including line numbers and function
>arguments.

You can get something like that with the Integrated Kernel Debugging
patch, ftp://e-mind.com/pub/linux/patch-ikd-arca, current is
patch-ikd-arca-2.2.5.bz2.

>Do any other Unices have any more advanced crash analysis techniques?
>I've read something about a kernel core file for some OS's.

The question of taking a kernel core dump has been been raised before
on Linux. The big stumbling block has always been "can we trust the
I/O layer after an Oops?". If the fault affects disk I/O then writing
to a kernel core file could easily corrupt your entire disk. AFAIK,
nobody has though the benefit worth the risk.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Next message: Mike Galbraith: "Re: [real fix] Re: [patch] fix for cc1 Out of memory [Re:"
Previous message: Linus Torvalds: "Re: [real fix] Re: [patch] fix for cc1 Out of memory [Re: [TESTCASE] `Out of memory for cc1']"
Maybe in reply to: Jordan Mendelson: "How To Debug Kernel Oops's"