Re: [PATCH 2/5] x86: fix runtime error inarch/x86/kernel/cpu/mcheck/mce_amd_64.c

From: Ingo Molnar
Date: Mon Jan 28 2008 - 14:02:37 EST



* Greg KH <gregkh@xxxxxxx> wrote:

> > Are kobjects protected against accidental copying? If not add &kobj
> > to the 'magic value' too, and check that - it becomes
> > copying-resistent that way and has the same cost to check. (which is
> > negligible anyway)
>
> Oh, that's a very cool idea, I like it :)

hey, you are welcome :-)

[ I guess i should not mention that i've implemented list debugging for
Linux that checksums the struct list contents and stores the checksum
in it (offset by a magic value plus to address of the list head), and
thus protects it against accidental corruption? It was capable of
reliably detecting mixed up list_add() arguments for example, it
detected list corruption of _every_ sort, it detected double
list_del() and list_add() of an already active list member as well. It
was even capable of detecting SMP races: two parallel unserialized
list_del()'s on the same list head were detected and warned about as
well. I guess i should release it one of these days? =B-) ]

> So I guess the complaint is that CONFIG_KOBJECT is too verbose? I can
> understand that, I'll look to clean things up as we have a lot of
> "tracing" type functions in the debug output. Hm, I wonder if the new
> marker code could help with that...

that would certainly be helpful and nice as a second line of defense.
(as a second line of defense the DEBUG_KOBJECT stuff was pretty
effective as well, as you have shown it with this bug.)

but the "first line of defense" must be an as automatically and as
transparently fault-resilient data structure as humanly possible.

"please enable DEBUG_KOBJECT and send me the logs" reduces the active
tester base to somewhere around 0.1% of the full tester base - and if
the bug is in any way spurious, the tester base goes down to 0.001% or
down to outright zero.

i really think that in terms of central kernel data structures, no
amount of debugging infrastructure is "too much" (within taste limits) -
as long as it's cleanly structured. Even 10,000 lines of debugging code
dwarves the 9,000,000 lines of code that it ultimately covers.

> > For example locks currently have 4000 lines of code of debugging
> > infrastructure and 1500 lines of code of self-tests. The total
> > amount of core locking code is less 1000 lines of code. So for every
> > line of locking code there's more than 5 lines of debugging
> > infrastructure (!).
>
> Heh, good point, I'll look into fixing this up to be more useful to
> the "normal" developer, and not just the two of us doing kobject core
> development :)

you are right, that is another important argument: meaningful debugging
checks that catch difficult to debug problems speed up development and
allows one to iterate the codebase faster. As long as the debugging
checks are transparent (i.e. they have no outrageous false positive rate
and they do not spam the logs), distros and testers will happily enable
it and you'll get much more feedback than you ever imagined would be
possible.

Time is precious, even if you know what you are doing, so the more of
the debugging you are capable of pushing to the computer, the better.
Even if some debugging levels are as outrageously expensive as "iterate
the linear list of all kobjects in the system, for every kobject op",
people will still be happy to turn it on, if it catches bugs. Attach a
nice WARN_ON_ONCE() to it and kerneloops.org will pick it up and
categorize it for you to check. And i think that's an important effort
for core subsystem maintainers: for example i spent far more time
writing debugging code for mutexes and other locks than i spent time
writing the facilities themselves. Preventing a certain category of bugs
is far more important to users than shaving off 1 cycle from a fastpath.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/