Re: [RFC PATCH] perf_counter: dynamically allocate tasks'perf_counter_context struct

From: Ingo Molnar
Date: Wed May 20 2009 - 13:12:29 EST



* Paul Mackerras <paulus@xxxxxxxxx> wrote:

> This replaces the struct perf_counter_context in the task_struct
> with a pointer to a dynamically allocated perf_counter_context
> struct. The main reason for doing is this is to allow us to
> transfer a perf_counter_context from one task to another when we
> do lazy PMU switching in a later patch.

Hm, i'm not sure how far this gets us towards lazy PMU switching.

In fact i'd say that the term "lazy PMU switching" is probably
misleading, we should use: "equivalent PMU context switching" or
instead.

The difference is really crucial. We cannot really detach a PMU
context from a task, because the task might migrate to another CPU
and could run it there. Any lazyness in the switching of the PMU
context would create the need to send IPIs and other overhead. For
similar reasons are lazy FPU switching methods not workable on SMP
generally.

Instead, the right abstraction is to define 'equivalency' between
task's PMU contexts, created by inheritance. When two tasks
context-switch that both have the same parent counter(s), we dont
need to do _any_ physical PMU switching. The counts (and events)
from one of the tasks can be freely transferred to the other task.
It's going to get summarized in the parent anyway, so
context-switching is an invariant.

To implement this, we need something like an 'ID', cookie or
generation counter for the context, which changes to another unique
number (or pointer) the moment a context is modified: a counter is
added, removed or a counter attribute is changed. When counters are
inherited the cookie gets carried over too. The context-switch code
can then do this optimization:

if (prev->ctx.cookie != next->ctx.cookie)
switch_pmu_ctx(prev, next);

... which will be _very_ fast for the inherited counters (perf stat)
case.

Note, this does put a few requirements on the architecture code, and
it requires a few changes to the sched-in/sched-out code and
requires a few changes to when tasks migrate to other CPUs.

For example the x86 code currently demuxes counter events back to
counter pointers, using a per-cpu structure:

struct cpu_hw_counters {
struct perf_counter *counters[X86_PMC_IDX_MAX];
unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
unsigned long active_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
unsigned long interrupts;
int enabled;
};

the counter pointers are per task - so this bit of cpu_hw_counters
needs to move into the ctx structure, so that if an overflow IRQ
comes in, we always only deal with local counters (not with some
previous task's counter pointers).

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/