Re: [PATCH] perf/core: Introduce cpuctx->cgrp_ctx_list

From: Namhyung Kim
Date: Wed Oct 04 2023 - 12:32:41 EST


Hi Peter,

On Wed, Oct 4, 2023 at 9:02 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Tue, Oct 03, 2023 at 09:08:44PM -0700, Namhyung Kim wrote:
>
> > But after the change, it ended up iterating all pmus/events in the cpu
> > context if there's a cgroup event somewhere on the cpu context.
> > Unfortunately it includes uncore pmus which have much longer latency to
> > control.
>
> Can you describe the problem in more detail please?

Sure.

>
> We have cgrp as part of the tree key: {cpu, pmu, cgroup, idx},
> so it should be possible to find a specific cgroup for a cpu and or skip
> to the next cgroup on that cpu in O(log n) time.

This is about a single (core) pmu when it has a lot of events.
But this problem is different, it's about accessing more pmus
unnecessarily.

Say we have the following events for CPU 0.

sw: context-switches
core: cycles, cycles-for-cgroup-A
uncore: whatever

The cpu context has a cgroup event so it needs to call
perf_cgroup_switch() at every context switch. But actually
it only needs to resched the 'core' pmu since it only has a
cgroup event. Other pmu events (like context-switches or
any uncore event) should not be bothered by that.

But perf_cgroup_switch() calls the general functions which
iterate all pmus in the (cpu) context.

cpuctx.ctx.pmu_ctx_list:
+-> sw -> core -> uncore (pmu_ctx_entry)

Then it disables pmus, sched-out current events, switch
cgroup pointer, sched-in new events and enable pmus.
This gives a lot more overhead when it has uncore pmus
since accessing MSRs for uncore pmus has longer latency.
But uncore pmus cannot have cgroup events in the first
place.

So we need a separate list to keep pmus that have
active cgroup events.

cpuctx.cgrp_ctx_list:
+-> core (cgrp_ctx_entry)

And we also need a logic to do the same work only
for this list.

Hope this helps.

>
> > To fix the issue, I restored a linked list equivalent to cgrp_cpuctx_list
> > in the perf_cpu_context and link perf_cpu_pmu_contexts that have cgroup
> > events only. Also add new helpers to enable/disable and does ctx sched
> > in/out for cgroups.
>
> Adding a list and duplicating the whole scheduling infrastructure seems
> 'unfortunate' at best.

Yeah, I know.. but I couldn't come up with a better solution.

Thanks,
Namhyung