Re: [patch] perf wrong enabled time after attach/enable/enable

From: Peter Zijlstra
Date: Wed Aug 02 2017 - 13:11:05 EST


On Tue, Jul 25, 2017 at 09:47:52PM -0400, Vince Weaver wrote:

> The problem happens with the following:
> fork a child, put it to sleep with SIGSTOP
> perf_event_open(), attaching to the pid of the child
> ENABLE/DISABLE/ENABLE/DISABLE/ENABLE
> read the results with the enabled/running multiplexing fields
>
> Your "enabled" time will be a huge negative number, which will confuse any
> code depening on it (especially the rdpmc scaling code).
>
> From what I can tell, when you DISABLE an event that's in a
> PERF_EVENT_STATE_INACTIVE state, currently it exits early without
> updating event->tstamp_stopped. So later when calculating the
> enabled time it is wrong and you get the negative number.
> The following quick hack fixes things, it just abuses the existing code
> that was there to handle a similar case when a filter failed. The proper
> fix might be to just put the event->tstamp_stopped update before the
> early out code, but the commit log for the change that introduced that
> makes it sound a bit scary.

fa66f07aa1f0 ("perf_events: Fix time tracking for events with pid != -1 and cpu != -1")

is the commit that introduced it (the scary one just moved the && to the
right place).

> Anyway a reproducible test case for this is in my pref_event_tests tree,
> the "tests/rdpmc/rdpmc_attach_multi_enable" test.
>
> Vince
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 426c2ffba16d..18eadc3b37d5 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1812,8 +1812,7 @@ event_sched_out(struct perf_event *event,
> * maintained, otherwise bogus information is return
> * via read() for time_enabled, time_running:
> */
> - if (event->state == PERF_EVENT_STATE_INACTIVE &&
> - !event_filter_match(event)) {
> + if (event->state == PERF_EVENT_STATE_INACTIVE) {
> delta = tstamp - event->tstamp_stopped;
> event->tstamp_running += delta;
> event->tstamp_stopped = tstamp;

It doesn't look entirely crazy.. just trying to understand why Stephane
added that filter clause.

Argh, that time accounting always hurts my brain, its so weird. I never
dared rewrite it for risk of breaking it all again, but maybe, just
maybe we should have.. :/


init(now):
enabled = stopped = running = now;

in(now):
running += now - stopped;

out(now):
stopped = now;

read:
*enabled = stopped - enabled;
*running = stopped - running;


Simple example:

init(10):
e = s = r = 10

in(11):
r += 11 - 10; r=11

out(12):
s = 12

read:
*e = 12-10 = 2
*r = 12-11 = 1

in(14):
r += 14 - 12; r=13

out(16):
s = 16

read:
*e = 16 - 10 = 6
*r = 16 - 13 = 3


which confusingly works :-)

So lets break that by out not updating stopped, then we get:


init(10):
e = s = r = 10

in(11):
r += 11 - 10; r=11

read:
*e = 10-10 = 2
*r = 10-11 = -1

in(14):
r += 14 - 10; r=15

read:
*e = 10 - 10 = 0
*r = 10 - 15 = -5


That is negative for 'running', not enabled. But your test really does
report a negative 'enabled'.

Playing with that test it really is the IOC_DISABLE while STOP'ed that
messes things up.

Ah.. magic.. the below seems to fix things, hopefully it doesn't break
anything else.

---
kernel/events/core.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 426c2ffba16d..5dbbd1f90b4f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2458,6 +2458,20 @@ perf_install_in_context(struct perf_event_context *ctx,
}

/*
+ * reverse update_event_times()
+ */
+void __perf_event_enable_time(struct perf_event *event, u64 tstamp)
+{
+ /*
+ * total_time_enabled = tstamp_stopped - tstamp_enabled
+ * total_time_running = tstamp_stopped - tstamp_running;
+ */
+ event->tstamp_stopped = tstamp;
+ event->tstamp_enabled = tstamp - event->total_time_enabled;
+ event->tstamp_running = tstamp - event->total_time_running;
+}
+
+/*
* Put a event into inactive state and update time fields.
* Enabling the leader of a group effectively enables all
* the group members that aren't explicitly disabled, so we
@@ -2471,10 +2485,11 @@ static void __perf_event_mark_enabled(struct perf_event *event)
u64 tstamp = perf_event_time(event);

event->state = PERF_EVENT_STATE_INACTIVE;
- event->tstamp_enabled = tstamp - event->total_time_enabled;
+ __perf_event_enable_time(event, tstamp);
+
list_for_each_entry(sub, &event->sibling_list, group_entry) {
if (sub->state >= PERF_EVENT_STATE_INACTIVE)
- sub->tstamp_enabled = tstamp - sub->total_time_enabled;
+ __perf_event_enable_time(sub, tstamp);
}
}