[PATCH] perf/core: set cgroup for cpu contexts for new cgroup events

From: David Carrillo-Cisneros
Date: Mon Aug 01 2016 - 23:05:44 EST


There is an optimization in perf_cgroup_sched_{in,out} that skips the
switch of cgroup events if the old and new cgroups in a task context
switch are the same. This optimization interacts with the current code
in two ways that cause a cpu context's cgroup (cpuctx->cgrp) to be NULL
despite having a cgroup event that matches the current task. These are:

1. On creation of the first cgroup event in a CPU: In current code,
cpuctx->cpu is only set in perf_cgroup_sched_in, but due to the
aforesaid optimization, perf_cgroup_sched_in will run until the next
cgroup switch in that cpu. This may happen late or never happen,
depending on system's number of cgroups, cpu load, etc.

2. On deletion of the last cgroup event in a cpuctx: In list_del_event,
cpuctx->cgrp is set NULL. Any new cgroup event will not be sched in
because cpuctx->cgrp == NULL until a cgroup switch occurs and
perf_cgroup_sched_in is executed (updating cpuctx->cgrp).

This patch fixes both problems by setting cpuctx->cgrp in list_add_event,
mirroring what list_del_event does when removing a cgroup event from CPU
context, as introduced in:
commit 68cacd29167b ("perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()")

With this patch, cpuctx->cgrp is always set/clear when installing/removing
the first/last cgroup event in/from the cpu context. Having cpuctx->cgrp
correctly set since the event is installed in the context allows
event_filter_match to work correctly while scheduling in and out events
without relying on a cgroup switch that may occur late (if ever).

The problem is easy to observe in a machine with only one cgroup:

$ perf stat -e cycles -I 1000 -C 0 -G /
# time counts unit events
1.000161699 <not counted> cycles /
2.000355591 <not counted> cycles /
3.000565154 <not counted> cycles /
4.000951350 <not counted> cycles /

After the fix, the output is as expected:

$ perf stat -e cycles -I 1000 -a -G /
# time counts unit events
1.004699159 627342882 cycles /
2.007397156 615272690 cycles /
3.010019057 616726074 cycles /

This patch also do minor style edit into list_del_event.

Rebased at peterz/queue/perf/core.

Signed-off-by: David Carrillo-Cisneros <davidcc@xxxxxxxxxx>
Reviewed-by: Stephane Eranian <eranian@xxxxxxxxxx>
---
kernel/events/core.c | 33 ++++++++++++++++++++++++---------
1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9345028..1efa89d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1392,6 +1392,8 @@ ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
static void
list_add_event(struct perf_event *event, struct perf_event_context *ctx)
{
+ struct perf_cpu_context *cpuctx;
+
lockdep_assert_held(&ctx->lock);

WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT);
@@ -1412,8 +1414,21 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
list_add_tail(&event->group_entry, list);
}

- if (is_cgroup_event(event))
- ctx->nr_cgroups++;
+ if (is_cgroup_event(event)) {
+ /*
+ * If there are no more cgroup events, set cgrp in context
+ * so event_filter_match works.
+ */
+ if (!ctx->nr_cgroups++) {
+ /*
+ * Because cgroup events are always per-cpu events,
+ * this will always be called from the right CPU.
+ */
+ cpuctx = __get_cpu_context(ctx);
+ cpuctx->cgrp = event->cgrp;
+ }
+ }
+

list_add_rcu(&event->event_entry, &ctx->event_list);
ctx->nr_events++;
@@ -1595,18 +1610,18 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
event->attach_state &= ~PERF_ATTACH_CONTEXT;

if (is_cgroup_event(event)) {
- ctx->nr_cgroups--;
- /*
- * Because cgroup events are always per-cpu events, this will
- * always be called from the right CPU.
- */
- cpuctx = __get_cpu_context(ctx);
/*
* If there are no more cgroup events then clear cgrp to avoid
* stale pointer in update_cgrp_time_from_cpuctx().
*/
- if (!ctx->nr_cgroups)
+ if (!--ctx->nr_cgroups) {
+ /*
+ * Because cgroup events are always per-cpu events,
+ * this will always be called from the right CPU.
+ */
+ cpuctx = __get_cpu_context(ctx);
cpuctx->cgrp = NULL;
+ }
}

ctx->nr_events--;
--
2.8.0.rc3.226.g39d4020