Re: [PATCH 4/4] [x86] perf: fix accidentally ack'ing a second eventon intel perf counter

From: Stephane Eranian
Date: Thu Sep 02 2010 - 04:13:28 EST


Robert,

Do you have the test program you used to test this?
I believe the NHM hack does not solve the problem, it
just makes it harder to appear.

I suspect the real issue is that the GLOBAL_STATUS
bitmask cannot be trusted. I'd like to verify this.

Has the problem appear only on Nehalem or also on
Westmere?


On Wed, Sep 1, 2010 at 4:57 PM, Robert Richter <robert.richter@xxxxxxx> wrote:
> On 01.09.10 09:04:45, Stephane Eranian wrote:
>> Don,
>>
>> Found your patch on LKML (I am not on it).
>>
>> In your changelog you said:
>>
>> > During testing of a patch to stop having the perf subsytem swallow nmis,
>> > it was uncovered that Nehalem boxes were randomly getting unknown nmis
>> > when using the perf tool.
>> >
>> > Moving the ack'ing of the PMI closer to when we get the status allows
>> > the hardware to properly re-set the PMU bit signaling another PMI was
>> > triggered during the processing of the first PMI. ÂThis allows the new
>> > logic for dealing with the shortcomings of multiple PMIs to handle the
>> > extra NMI by 'eat'ing it later.
>>
>> > Now one can wonder why are we getting a second PMI when we disable all
>> > the PMUs in the beginning of the NMI handler to prevent such a case, for
>> > that I do not know. ÂBut I know the fix below helps deal with this quirk.
>> >
>>
>> I am assuming you're talking about back-to-back NMIs here, not nested NMIs.
>> I don't quite understand the scenario here. Is it the case that you handled 1
>> overflow, and then right as you return from the interrupt, you get a second
>> PMI with a ovfl_status=0 ?
>>
>> What events did you measure? Which counters did you use?
>> Did you have HT turned on?
>
> It is related to this thread:
>
> Âhttp://lkml.org/lkml/2010/8/25/124
>
> Not acking the status immediately triggered an nmi, but the status was
> 0. Acking after reading and before processing the counters results in
> a non-zero status and thus, no empty nmi.
>
> -Robert
>
>>
>> > Tested on multiple Nehalems where the problem was occuring. ÂWith the
>> > patch, the code now loops a second time to handle the second PMI (whereas
>> > before it was not).
>>
>
> --
> Advanced Micro Devices, Inc.
> Operating System Research Center
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/