Re: Top 10 kernel oopses for the week ending January 5th, 2008

From: Randy Dunlap
Date: Tue Jan 08 2008 - 11:15:05 EST


On Mon, 7 Jan 2008 19:26:12 -0800 (PST) Linus Torvalds wrote:

> On Mon, 7 Jan 2008, Kevin Winchester wrote:
>
> > J. Bruce Fields wrote:
> > >
> > > Is there any good basic documentation on this to point people at?
> >
> > I would second this question. I see people "decode" oops on lkml often
> > enough, but I've never been entirely sure how its done. Is it somewhere
> > in Documentation?
>
> It's actually not necessarily at all that trivial, unless you have a deep
> understanding of the code generated for the architecture in question (and
> even then, some oopses take more time to figure out than others, thanks
> to inlining and tailcalls etc).
>
> If the oops happened with a kernel you generated yourself, it's usually
> rather easy. Especially if you said "y" to the "generate debugging info"
> question at configuration time. Because, in that case, you really just do
> a simple
>
> gdb vmlinux
>
> and then you can do (for example) something like setting a breakpoint at
> the EIP that was reported for the oops, and it will tell you what line it
> came from.
>
> However, if you don't have the exact binary - which is the common case for
> random oopses reported on lkml - you will generally have to disassemble
> the hex sequence given in the oops (the "Code:" line), and try to match it
> up against the source code to try to figure out what is going on.
>
> Even just the disassembly is not entirely trivial, since the oops will
> give you the eip that it happened at, but you often want to also
> disassemble *backwards* in order to get more of a context (the "Code:"
> line will mark the particular EIP that starts the oopsing instruction by
> enclosing it in <xx>, but with non-constant instruction lengths, you need
> to use a bit of trial-and-error to figure it out.
>
> I usually just compile a small program like
>
> const char array[]="\xnn\xnn\xnn...";
>
> int main(int argc, char **argv)
> {
> printf("%p\n", array);
> *(int *)0=0;
> }
>
> and run it under gdb, and then when it gets the SIGSEGV (due to the
> obvious NULL pointer dereference), I can just ask gdb to disassemble
> around the array that contains the code[] stuff. Try a few offsets, to see
> when the disassembly makes sense (and gives the reported EIP as the
> beginning of one of the disassembled instructions).
>
> (You can do it other and smarter ways too, I'm not claiming that's a
> particularly good way to do it, and the old "ksymoops" program used to do
> a pretty good job of this, but I'm used to that particular idiotic way
> myself, since it's how I've basically always done it)

One other way to do it (at least for x86-32/64) is to use
$kerneltree/scripts/decodecode. It may work on other $arches also,
but I haven't tested it on others.

---
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/