Re: [PATCH RFC] NMI Re-introduce un[set]_nmi_callback

From: Ingo Molnar
Date: Fri Sep 05 2008 - 06:25:26 EST



* Don Zickus <dzickus@xxxxxxxxxx> wrote:

> Prarit's patch disabled the timer upon registering a callback to
> prevent this case. The thought was if you have your own handler you
> could provide your own watchdog.

btw., while we are talking about watchdog design problems, the current
way the NMI timer watchdog oprofile fallback uses the die handler and
how it all interacts with the ioapic/lapic-timer NMI watchdog is
misdesigned as well. (this is used in the relatively rare but still
possible case where no primary oprofile handler is found and installed)

The i386 and x86_64 architecture code started using die notifiers for
the NMI timer watchdog oprofile code two years ago, via commit 2fbe7b25
"i386/x86-64: Remove un/set_nmi_callback and reserve/release_lapic_nmi
functions". [ Hey - what a coincidence - that was done by you! ;-) ]

It was a nice cleanup but not complete: using a die notifier for the NMI
watchdog does not work properly as the NMI watchdog driver has no
knowledge currently about whether it got activated (often it's not even
possible to tell it):

+ case DIE_NMI:
+ oprofile_add_sample(args->regs, 0);
+ ret = NOTIFY_STOP;
+ break;

it always returns NOTIFY_STOP. That breaks all other lower prio
notifiers down the chain which might have proper knowledge about the
source of the NMI. Such kind of cross-notifier breakage is one of the
reasons why driver notifiers try to become exclusive and try to play
tricks with the nmi watchdog.

Instead the full solution would be for it to return NOTIFY_OK when it is
not the source of the NMI (and is able to tell it).

[
The reason why i qualified my statement with "when it is able to tell
it" is that while it is possible to disambiguate the NMI source when
the source of the NMI was the local apic timer (we already do that in
lapic_wd_event() - we return 1 if we rearmed the counter in the CPU),
it is not possible to tell it reliably that the NMI source was the
IO-APIC.

The reason for that is that the IO-APIC generated NMIs are
fundamentally 'anonymous' in that case: we put the legacy PIC into
auto-EOI mode and the IO-APIC broadcasts the NMIs and they are edge
triggered (on all but some really ancient systems) . So there's no
permanent and reliable information left about why the NMI was raised
in the CPU. This is a property of the hardware, we cannot solve it on
the kernel side. (we tried explicit ACKs before, and those are even
worse.)
]

So the right long-term solution is to move all NMI watchdog code
(including the lockup detection bits - nmi_watchdog_tick(), etc.) into
the notifier, then to propagate the re-arm information from
lapic_wd_event() back into the notifier return code, and perhaps to
print a (one-time) warning into the kernel log about the (in essence)
deactivated lower-prio notifiers when the IO-APIC watchdog is activated.

This is all non-default debug or instrumentation functionality so if
they cancel out each other in certain rare cases that's not a big issue
in practice, as long as the user is informed about the constraints.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/