Re: [PATCH v6] perf: Sharing PMU counters across compatible events

From: Song Liu
Date: Thu Oct 31 2019 - 12:29:55 EST




> On Oct 31, 2019, at 5:43 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Wed, Sep 18, 2019 at 10:23:14PM -0700, Song Liu wrote:
>> This patch tries to enable PMU sharing. To make perf event scheduling
>> fast, we use special data structures.
>>
>> An array of "struct perf_event_dup" is added to the perf_event_context,
>> to remember all the duplicated events under this ctx. All the events
>> under this ctx has a "dup_id" pointing to its perf_event_dup. Compatible
>> events under the same ctx share the same perf_event_dup. The following
>> figure shows a simplified version of the data structure.
>>
>> ctx -> perf_event_dup -> master
>> ^
>> |
>> perf_event /|
>> |
>> perf_event /
>>
>> Connection among perf_event and perf_event_dup are built when events are
>> added or removed from the ctx. So these are not on the critical path of
>> schedule or perf_rotate_context().
>>
>> On the critical paths (add, del read), sharing PMU counters doesn't
>> increase the complexity. Helper functions event_pmu_[add|del|read]() are
>> introduced to cover these cases. All these functions have O(1) time
>> complexity.
>>
>> We allocate a separate perf_event for perf_event_dup->master. This needs
>> extra attention, because perf_event_alloc() may sleep. To allocate the
>> master event properly, a new pointer, tmp_master, is added to perf_event.
>> tmp_master carries a separate perf_event into list_[add|del]_event().
>> The master event has valid ->ctx and holds ctx->refcount.
>
> That is realy nasty and expensive, it basically means every !sampling
> event carries a double allocate.
>
> Why can't we use one of the actual events as master?

I think we can use one of the event as master. We need to be careful when
the master event is removed, but it should be doable. Let me try.

>
>> +/*
>> + * Sharing PMU across compatible events
>> + *
>> + * If two perf_events in the same perf_event_context are counting same
>> + * hardware events (instructions, cycles, etc.), they could share the
>> + * hardware PMU counter.
>> + *
>> + * When a perf_event is added to the ctx (list_add_event), it is compared
>> + * against other events in the ctx. If they can share the PMU counter,
>> + * a perf_event_dup is allocated to represent the sharing.
>> + *
>> + * Each perf_event_dup has a virtual master event, which is called by
>> + * pmu->add() and pmu->del(). We cannot call perf_event_alloc() in
>> + * list_add_event(), so it is allocated and carried by event->tmp_master
>> + * into list_add_event().
>> + *
>> + * Virtual master in different cases/paths:
>> + *
>> + * < I > perf_event_open() -> close() path:
>> + *
>> + * 1. Allocated by perf_event_alloc() in sys_perf_event_open();
>> + * 2. event->tmp_master->ctx assigned in perf_install_in_context();
>> + * 3.a. if used by ctx->dup_events, freed in perf_event_release_kernel();
>> + * 3.b. if not used by ctx->dup_events, freed in perf_event_open().
>> + *
>> + * < II > inherit_event() path:
>> + *
>> + * 1. Allocated by perf_event_alloc() in inherit_event();
>> + * 2. tmp_master->ctx assigned in inherit_event();
>> + * 3.a. if used by ctx->dup_events, freed in perf_event_release_kernel();
>> + * 3.b. if not used by ctx->dup_events, freed in inherit_event().
>> + *
>> + * < III > perf_pmu_migrate_context() path:
>> + * all dup_events removed during migration (no sharing after the move).
>> + *
>> + * < IV > perf_event_create_kernel_counter() path:
>> + * not supported yet.
>> + */
>> +struct perf_event_dup {
>> + /*
>> + * master event being called by pmu->add() and pmu->del().
>> + * This event is allocated with perf_event_alloc(). When
>> + * attached to a ctx, this event should hold ctx->refcount.
>> + */
>> + struct perf_event *master;
>> + /* number of events in the ctx that shares the master */
>> + int total_event_count;
>> + /* number of active events of the master */
>> + int active_event_count;
>> +};
>> +
>> +#define MAX_PERF_EVENT_DUP_PER_CTX 4
>> /**
>> * struct perf_event_context - event context structure
>> *
>> @@ -791,6 +849,9 @@ struct perf_event_context {
>> #endif
>> void *task_ctx_data; /* pmu specific data */
>> struct rcu_head rcu_head;
>> +
>> + /* for PMU sharing. array is needed for O(1) access */
>> + struct perf_event_dup dup_events[MAX_PERF_EVENT_DUP_PER_CTX];
>
> Yuck!
>
> event_pmu_{add,del,read}() appear to be the consumer of this array
> thing, but I'm not seeing why we need it.
>
> That is, again, why can't we use one of the actual events as master and
> have a dup_master pointer per event and then do something like:
>
> event_pmu_add()
> {
> if (event->dup_master != event)
> return;
>
> event->pmu->add(event, PERF_EF_START);
> }
>
> Such that we only schedule the master events and ignore all duplicates.
>
> Then on read it can do something like:
>
> event_pmu_read()
> {
> if (event->dup_master == event)
> return;
>
> /* use event->dup_master as counter */
> again:
> prev_count = local64_read(&hwc->prev_count);
> count = local64_read(&event->dup_master->count);
> if (local64_cmpxchg(&hwc->prev_count, prev_count, count) != prev_count)
> goto again;
>
> delta = count - prev_count;
> local64_add(delta, &event->count);
> }
>
>> };
>
>> +/* Returns whether a perf_event can share PMU counter with other events */
>> +static inline bool perf_event_can_share(struct perf_event *event)
>> +{
>> + /* only do sharing for hardware events */
>> + if (is_software_event(event))
>> + return false;
>> +
>> + /*
>> + * limit sharing to counting events.
>> + * perf-stat sets PERF_SAMPLE_IDENTIFIER for counting events, so
>> + * let that in.
>> + */
>> + if (event->attr.sample_type & ~PERF_SAMPLE_IDENTIFIER)
>> + return false;
>
> Why is is_sampling_event() not usable?

Hmm... let me try it. Thanks for the pointer.

>
>> +
>> + return true;
>> +}
>> +
>> +/*
>> + * Returns whether the two events can share a PMU counter.
>> + *
>> + * Note: This function does NOT check perf_event_can_share() for
>> + * the two events, they should be checked before this function
>> + */
>> +static inline bool perf_event_compatible(struct perf_event *event_a,
>> + struct perf_event *event_b)
>> +{
>> + return event_a->attr.type == event_b->attr.type &&
>> + event_a->attr.config == event_b->attr.config &&
>> + event_a->attr.config1 == event_b->attr.config1 &&
>> + event_a->attr.config2 == event_b->attr.config2;
>> +}
>
> Slightly scared by this one.

I feel a little nervous too. Maybe we should memcmp the two attr?

>
>
>> @@ -2612,20 +2828,9 @@ static int __perf_install_in_context(void *info)
>> raw_spin_lock(&task_ctx->lock);
>> }
>>
>> -#ifdef CONFIG_CGROUP_PERF
>> - if (is_cgroup_event(event)) {
>> - /*
>> - * If the current cgroup doesn't match the event's
>> - * cgroup, we should not try to schedule it.
>> - */
>> - struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
>> - reprogram = cgroup_is_descendant(cgrp->css.cgroup,
>> - event->cgrp->css.cgroup);
>> - }
>> -#endif
>
> Why is this removed?

e... I bet I messed this up during a rebase... Sorry..

>
>> @@ -10986,6 +11198,14 @@ SYSCALL_DEFINE5(perf_event_open,
>> goto err_cred;
>> }
>>
>> + if (perf_event_can_share(event)) {
>> + event->tmp_master = perf_event_alloc(&event->attr, cpu,
>> + task, NULL, NULL,
>> + NULL, NULL, -1);
>> + if (IS_ERR(event->tmp_master))
>> + event->tmp_master = NULL;
>> + }
>
>
>> @@ -11773,6 +12005,14 @@ inherit_event(struct perf_event *parent_event,
>> if (IS_ERR(child_event))
>> return child_event;
>>
>> + if (perf_event_can_share(child_event)) {
>> + child_event->tmp_master = perf_event_alloc(&parent_event->attr,
>> + parent_event->cpu,
>> + child, NULL, NULL,
>> + NULL, NULL, -1);
>> + if (IS_ERR(child_event->tmp_master))
>> + child_event->tmp_master = NULL;
>> + }
>
> So this is terrible!

Let me try get rid of the double alloc.

Thanks for these feedback!
Song