Re: [External] Re: [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup

From: Chengming Zhou
Date: Wed Mar 23 2022 - 09:37:35 EST


On 2022/3/23 9:17 下午, Peter Zijlstra wrote:
> On Wed, Mar 23, 2022 at 09:07:01PM +0800, Chengming Zhou wrote:
>> On 2022/3/23 8:51 下午, Peter Zijlstra wrote:
>>> On Tue, Mar 22, 2022 at 08:08:30PM +0800, Chengming Zhou wrote:
>>>
>>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>>> index 8b5cf2aedfe6..848a3bfa9513 100644
>>>> --- a/kernel/events/core.c
>>>> +++ b/kernel/events/core.c
>>>
>>>> @@ -843,11 +845,21 @@ static void perf_cgroup_switch(struct task_struct *task)
>>>> */
>>>> local_irq_save(flags);
>>>>
>>>> + cgrp = perf_cgroup_from_task(task, NULL);
>>>> + if (cgrp == __this_cpu_read(cpu_perf_cgroup))
>>>> + goto out;
>
> So this compares the cpu thing against the task thing, if matching, we
> bail.
>
>>>> +
>>>> + __this_cpu_write(cpu_perf_cgroup, cgrp);
>
> Then we set cpu thing.
>
>>>> +
>>>> list = this_cpu_ptr(&cgrp_cpuctx_list);
>>>> list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
>>>> WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
>>>>
>>>> perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>>> +
>>>> + if (cpuctx->cgrp == cgrp)
>>>> + continue;
>>>> +
>>>> perf_pmu_disable(cpuctx->ctx.pmu);
>>>>
>>>> cpu_ctx_sched_out(cpuctx, EVENT_ALL);
>
>>>> + cpuctx->cgrp = cgrp
>
> But here we already have exactly the same pattern, we match cpuctx thing
> against task thing and skip/set etc.

Yes, cpu_perf_cgroup is just "cache" which cgrp assigned to cpuctx->cgrp.

>
>>> Also, I really don't see the point of cpu_perf_cgroup, cpuctx->cgrp
>>> should be able to do this.
>>
>> But the problem is that we have two cpuctx on the percpu list, do you
>> mean we should use perf_cgroup of the first cpuctx to compare with
>> the next task's perf_cgroup ?
>>
>> Or we should delete the cgrp in cpuctx, and use this new percpu cpu_perf_cgroup?
>
> I'm a bit confused, per the above, you already do exactly what the new
> cpu_perf_cgroup does on the cpuctx->cgrp variable. AFAICT the only think
> the new per-cpu variable does is avoid a lock, howveer:

we have cgrp in cpuctx make me confused at first too.

perf_cgroup_event_enable()
if (ctx->is_active && !cpuctx->cgrp)
if ...
cpuctx->cgrp = cgrp; --> set cpuctx->cgrp when enable event

list_add(&cpuctx->cgrp_cpuctx_entry,
per_cpu_ptr(&cgrp_cpuctx_list, event->cpu)) --> add cpuctx on percpu list

But we have two (hw and sw) cpuctx, so these two cpuctx->cgrp maybe different...

This is the reason why I want to create a percpu cpu_perf_cgroup,
just "cache" cgrp last scheduled, to compare with next task to decide
whether perf_cgroup_switch() can be skipped.

But you are right, having cpuctx->cgrp and cpu_perf_cgroup make things confused..
maybe we can delete cpuctx->cgrp, change to use percpu cpu_perf_cgroup?

>
>
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -833,6 +833,7 @@ static DEFINE_PER_CPU(struct list_head,
>>> */
>>> static void perf_cgroup_switch(struct task_struct *task)
>>> {
>>> + struct perf_cgroup *cgrp;
>>> struct perf_cpu_context *cpuctx, *tmp;
>>> struct list_head *list;
>>> unsigned long flags;
>>> @@ -843,11 +844,20 @@ static void perf_cgroup_switch(struct ta
>>> */
>>> local_irq_save(flags);
>>>
>>> + cgrp = perf_cgroup_from_task(task, NULL);
>>> +
>>> list = this_cpu_ptr(&cgrp_cpuctx_list);
>>> list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
>>> WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
>>>
>>> + if (READ_ONCE(cpuctx->cgrp == cgrp))
>>> + continue
>
> I think we can avoid that by doing an early check, hmm?

perf_cgroup_switch() can be called from context_switch(), or __perf_cgroup_move() from IPI.
Say if in context_switch() already set cpuctx->cgrp to the new task->cgroups, then context_switch()
finish, handle IPI in __perf_cgroup_move(), we don't need to do sched_out/in again.

Thanks.

>
>>> +
>>> perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>> +
>>> + if (cpuctx->cgrp == cgrp)
>>> + goto next;
>>> +
>>> perf_pmu_disable(cpuctx->ctx.pmu);
>>>
>>> cpu_ctx_sched_out(cpuctx, EVENT_ALL);
>>> @@ -855,50 +865,22 @@ static void perf_cgroup_switch(struct ta
>>> * must not be done before ctxswout due
>>> * to event_filter_match() in event_sched_out()
>>> */
>>> - cpuctx->cgrp = perf_cgroup_from_task(task,
>>> - &cpuctx->ctx);
>>> + WRITE_ONCE(cpuctx->cgrp, cgrp);