KDB in the mainstream 2.4.x kernels?

From: linas@austin.ibm.com
Date: Fri Jul 18 2003 - 15:06:33 EST

Will there be a day that I can expect to find KDB in the 2.4.x kernel?
I know that a traditional answer has been 'never', but I would like the
various influencers and decision makers to reconsider ...
I agree with Linus Torvalds that debuggers are 100% useless when you
are working on code that you know intimately. I know, I've written a
lot of code, I'm proud of it, and I sneer at people who use words like
'development environment'. Crap, if you can't figure out why your
code crashed, you shouldn't be a programmer. But these days, I am not
debugging my code. I'm debugging code that I've never seen before.
And for that, I use KDB.
Right now, I work in a job where the *only* thing that I do is to analyze
and sometimes (when I'm lucky) fix kernel crashes. Its all I do.
I don't write any new code, don't do any porting at all. I also don't
debug any 2.5/2.6 'unstable' kernels, nor do I handle any new/unstable
device drivers. I focus entirely on the 2.4.x kernels, and, with a
small team here, there are more than enough kernel bugs to keep us all
completely busy. The crashes are generated by a test team of 8 people
with dozens of machines. Ostensibly their mission is to test new
hardware, but in fact, almost all the crashes that they find are kernel
bugs. The *only* thing that the test team does is to run stress tests.
Basic stuff. Kernel stress. File create/delete/copy. Reiser, jfs, ext3,
swap, OOM, scsi. Network, nfs, samba. Some tests take hours to crash
the kernel, some take days. But the kernel crashes. Its always crashing.
Corruption, races, missing locks, typos, bad hardware, you name it.
When I get it, it has a KDB prompt in front of it. KDB is great.
I can figure out where it crashed, I can look at the assembly, I can
examine memory locations. I can chase pointers by hand. And I can
do it all symbolically, with the symbol names in front of me. Now,
KDB rarely points right at the bug, but it is invaluable for figuring
out where to start looking. Sometimes I even find the bug, often
I don't. But anyway, this is all academic, because its at work, in
a controlled environment, where I have the time and resources I need.
But the real reason I write this note is that I want to have the same
capability at home. It suddenly occurred to me that the servers I run
at home sometimes (rarely) crash with the same symptoms as those at work.
Sure, I can probably blame buggy PC hardware. But .. I dunno. I've been
consistently ignoring these crashes cause its just too much of a hassle
to try to debug them. Its not worth the effort. But hey ... if I had
KDB at home... maybe it would be worth looking into the hangs. I could
see getting motivated to look into some of these. At least get some
idea of where the machine got hung. Maybe no fix, but at least
somewhere to lay the blame.
Yes, of course I could just apply the KDB patches myself, but frankly
its a hassle. I already play the patch game and I hate it. Every new
kernel, I have to try to remember where to find patch x, how to apply
it, fix up this and that... its just plain painful.
I know that this is not a forceful argument. But crashes are a fact of
life, whatever the reason may be. And the crashes almost always happen
in a piece of code I have *never* laid eyes on before, so its unrealistic
to try to puzzle it out with the small dollop of info from magic-sysreq.
Debuggers can be useless, or worse than useless, when you are a developer
on a piece of code you know well. But when plunging into foreign territory,
all the tools and firepower that you can muster are worth every bit.
This is why KDB belongs in the mainstream kernel distros.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

This archive was generated by hypermail 2b29 : Wed Jul 23 2003 - 22:00:35 EST