Re: [PATCH RFC] NMI Re-introduce un[set]_nmi_callback

From: Vivek Goyal
Date: Thu Sep 04 2008 - 15:09:01 EST


On Thu, Sep 04, 2008 at 02:26:37PM -0400, Don Zickus wrote:
> On Thu, Sep 04, 2008 at 07:52:31PM +0200, Andi Kleen wrote:
> > On Thu, Sep 04, 2008 at 01:20:52PM -0400, Don Zickus wrote:
> > > On Thu, Sep 04, 2008 at 05:52:17PM +0200, Andi Kleen wrote:
> > > > Then if there's a chipset specific NMI driver it could
> > > > also check if the chipset raised it. That would be a possible
> > > > solution for HP -- they would need to implement such a driver
> > > > for their systems with the special watchdog.
> > >
> > > The thing with HP's special watchdog timer is that it does _not_ have a
> > > chipset specific NMI it is trying to catch. HP is going on the assumption
> > > that _all_ NMIs are /bad/ and they want to catch _every_ NMI, log it, and
> > > reboot the system.
> >
> > That's my point. If you have drivers which can identify all other
> > NMIs then the left over NMIs must come from that watchdog driver.
> > So they just need drivers which can do that for their chipsets.
>
> Except their chipsets are _not_ producing NMIs. They just want to
> supercede all the other NMI handlers. For example if an EDAC NMI came in,
> they don't want the EDAC handler to try and recover from it, HP just wants
> their NMI watchdog to grab the NMI, log it and reboot.
>
> >
> > It's not race free, but that's simply not possible with the x86
> > NMI architecture.
>
> I agree.
>
> >
> > Better would be probably to just configure the watchdog
> > to reboot the system directly on its own. Most other watchdogs
> > I'm aware of do that. That's more reliable anyways because the system
> > might be wedged enough to not be able to process NMIs anymore.
>
> The trick is they want to log it in a special way (BIOS or NVRAM or
> something I forget) before rebooting.
>
> >
> > >
> > > Now obviously NMIs from kgdb and oprofile are not the ones a system should
> > > panic on but this breaks HP's assumptions.
> > >
> > > So that is part of the problem. How do you become a catch-all for NMIs in
> > > a system, to process as you wish, but ignore all the 'safe' NMIs?
> >
> > To be fully reliable: you need a new NMI architecture or move the event
> > somewhere else.
> > To be reasonable reliable (assuming NMis are not very frequent): you
> > need drivers for all NMI sources that can identify them.
>
> Yeah I know. Originally I thought this would be easy, just replace the
> default handler. But once the mention of kgdb and oprofile using the NMIs
> came up, I realized we are almost back to square one. :-(
>

Add "kdump" to the list. It will also be broken if we decide to let one
driver hijack the NMI handler.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/