[PATCH] perf/x86: fix event counter update issue

From: kan . liang
Date: Mon Nov 28 2016 - 14:27:25 EST


From: Kan Liang <kan.liang@xxxxxxxxx>

Fixes: ec3232bdf851 ("perf_counter: x86: More accurate counter update")

A bug can be easily reproduced by perf stat loop program on KNL/SLM.
The loop program is just "for (i=0;i<loops;i++);"
"loops" is set to different value during the test.
Test 1:
perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 60000000000
451739576564,,cycles,313694125185,100.00
Test 2:
perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 100000000000
18446743727079256064,,cycles,313694125185,100.00
Test 3:
perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 200000000000
406240041973,,cycles,313694125185,100.00
Only test 1 has correct result.
The result for test 2 is too big. Test 3 is too small.

The bug happens more frequently on KNL/SLM due to narrower counters,
but should be generic issue on all platforms.

The issues are caused by mistakenly calculating the counter
increment (delta) in x86_perf_event_update for some cases.

Here, all the possible failure cases are listed.
Terms:
- new: current PMU counter value which read from rdpmcl.
- prev: previous counter value which is stored in &hwc->prev_count.
- in PMI/not in PMI: the event update happens in PMI handler or not.
Current code to calculate delta:
delta = (new << shift) - (prev << shift);
delta >>= shift;

Case A: Not in PMI. new > prev. But delta is negative.
That's the failure case of Test 2.
delta is s64 type. new and prev are u64 type. If the new is big
enough, after doing left shift and sub, the bit 64 of the result may
still be 1.
After converting to s64, the sign flag will be set. Since delta is
s64 type, arithmetic right shift is applied, which copy the sign flag
into empty bit positions on the upper end of delta.
It can be fixed by adding the max count value.

Here is the real data for test2 on KNL.
new = aea96e1927
prev = ffffff0000000001
delta = aea96e1927000000 - 1000000 = aea96e1926000000
aea96e1926000000 >> 24 = ffffffaea96e1926 << negative delta

Case B: In PMI. new > prev. delta is positive.
That's the failure case of Test 3.
The PMI is triggered by overflow. But there is latency between
overflow and PMI handler. So new has small amount.
Current calculation lose the max count value.

Case C: In PMI. new < prev. delta is negative.
The PMU counter may be start from a big value. E.g. the fixed period
is small.
It can be fixed by adding the max count value.

There is still a case which delta could be wrong.
The case is that event update just happens in PMI and overflow gap. It's
not in PMI, new > prev, and delta is positive. Current calculation may
lose the max count value.
But this case rarely happens. So the rare case doesn't handle here.

Reported-by: Lukasz Odzioba <lukasz.odzioba@xxxxxxxxx>
Signed-off-by: Kan Liang <kan.liang@xxxxxxxxx>
Tested-by: Lukasz Odzioba <lukasz.odzioba@xxxxxxxxx>
---
arch/x86/events/core.c | 23 +++++++++++++++++++----
arch/x86/events/intel/core.c | 4 ++--
arch/x86/events/intel/p4.c | 2 +-
arch/x86/events/perf_event.h | 2 +-
4 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 6c3b0ef..c351ac4 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
* Can only be executed on the CPU where the event is active.
* Returns the delta events processed.
*/
-u64 x86_perf_event_update(struct perf_event *event)
+u64 x86_perf_event_update(struct perf_event *event, bool pmi)
{
struct hw_perf_event *hwc = &event->hw;
int shift = 64 - x86_pmu.cntval_bits;
@@ -100,6 +100,21 @@ u64 x86_perf_event_update(struct perf_event *event)
delta = (new_raw_count << shift) - (prev_raw_count << shift);
delta >>= shift;

+ /*
+ * Correct delta for special cases caused by the late PMI handle.
+ * Case1: PMU counter may be start from a big value.
+ * In PMI, new < prev. delta is negative.
+ * Case2: new is big enough which impact the sign flag.
+ * The delta will be negative after arithmetic right shift.
+ * Case3: In PMI, new > prev.
+ * The new - prev lose the max count value.
+ *
+ * There may be event update in PMI and overflow gap,
+ * but it rarely happen. The rare case doesn't handle here.
+ */
+ if (((delta > 0) && pmi) || (delta < 0))
+ delta += x86_pmu.cntval_mask + 1;
+
local64_add(delta, &event->count);
local64_sub(delta, &hwc->period_left);

@@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event, int flags)
* Drain the remaining delta count out of a event
* that we are disabling:
*/
- x86_perf_event_update(event);
+ x86_perf_event_update(event, (flags == 0));
hwc->state |= PERF_HES_UPTODATE;
}
}
@@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)

event = cpuc->events[idx];

- val = x86_perf_event_update(event);
+ val = x86_perf_event_update(event, true);
if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
continue;

@@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);

static inline void x86_pmu_read(struct perf_event *event)
{
- x86_perf_event_update(event);
+ x86_perf_event_update(event, false);
}

/*
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a74a2db..69d65e6 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
for (i = 0; i < 4; i++) {
event = cpuc->events[i];
if (event)
- x86_perf_event_update(event);
+ x86_perf_event_update(event, false);
}

for (i = 0; i < 4; i++) {
@@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct perf_event *event)
*/
int intel_pmu_save_and_restart(struct perf_event *event)
{
- x86_perf_event_update(event);
+ x86_perf_event_update(event, true);
/*
* For a checkpointed counter always reset back to 0. This
* avoids a situation where the counter overflows, aborts the
diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
index eb05335..8117de8 100644
--- a/arch/x86/events/intel/p4.c
+++ b/arch/x86/events/intel/p4.c
@@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs *regs)
/* it might be unflagged overflow */
overflow = p4_pmu_clear_cccr_ovf(hwc);

- val = x86_perf_event_update(event);
+ val = x86_perf_event_update(event, true);
if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
continue;

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index c6b25ac..09c9db2 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs
[PERF_COUNT_HW_CACHE_OP_MAX]
[PERF_COUNT_HW_CACHE_RESULT_MAX];

-u64 x86_perf_event_update(struct perf_event *event);
+u64 x86_perf_event_update(struct perf_event *event, bool pmi);

static inline unsigned int x86_pmu_config_addr(int index)
{
--
2.4.3