Re: [RFC] x86, NMI, Treat unknown NMI as hardware error

From: Huang Ying
Date: Sun Jun 12 2011 - 21:34:52 EST


On 06/09/2011 08:09 PM, Don Zickus wrote:
> On Fri, May 20, 2011 at 04:13:25PM +0800, Huang Ying wrote:
>> Hi, Don,
>>
>> On 05/18/2011 03:07 AM, Don Zickus wrote:
>>> On Tue, May 17, 2011 at 11:18:59AM -0700, Andi Kleen wrote:
>>>>> Random thought, in the Firmware first mode of HEST (which is the only way
>>>>> GHES records get produced??), does an SCI happen first to jump into the
>>>>> firmware for processing, then an NMI?
>>>>
>>>> Either that or there is a separate service processor which handles it.
>>>> Presumably it depends a lot on the particular system.
>>>
>>> Ah interesting. I was going to suggest somehow setting a bit when an SCI
>>> comes in and check that bit in the unknown NMI path as a possible hint
>>> that the NMI might be related to HEST (sorta how we flag unknown NMIs in
>>> the perf code).
>>>
>>> It was just an idea. Obviously a service processor will make that more
>>> difficult. :-)
>>
>> Hmm, what's the conclusion? Do you think unknown NMI should be seen as
>> hardware error? At least on some white listed machines?
>
> I still sorta have the opinion that a hardware error should be able be
> recognizable either through a GHES record or a bit in the southbridge.
> Whereas an unknown NMI is something lost and has no owner as the result of
> either a buggy NMI handler or an unimplemented NMI handler.
>
> Yeah, I can see hardware errors coming in through an unknown NMI but to me
> (from what I am reading about with APEI/GHES) is those should be trapped
> by the firmware and if they aren't then the firmware is broken. In those
> cases it should be up to the OEM to provide proper firmware (even certify
> them) to allow the proper experience, which includes being properly
> trapped by an NMI handler.
>
> Perhaps I am a bit naive in my belief but I am a little nervous panicing
> all the time on unknown NMIs when we are still chasing missed perf NMIs on
> a loaded box.

I think things SHOULD go this way too. This just is not the reality.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/