Re: [Regression or Fix]perf: profiling stats sigificantly changed for aio_write/read(ext4) between 6.7.0-rc1 and 6.6.0

From: Namhyung Kim
Date: Wed Nov 15 2023 - 10:45:54 EST


Hello,

On Wed, Nov 15, 2023 at 2:32 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
>
> Namhyung, could you please take a look, you know how to operate this
> cgroup stuff.

Sure.

>
> On Tue, Nov 14, 2023 at 11:20:32PM +0800, David Wang wrote:
> >
> >
> > At 2023-11-14 12:31:12, "David Wang" <00107082@xxxxxxx> wrote:
> > >Hi,
> > >
> > >I was making kernel profiling to identify kernel changes between revisions under the load of fio benchmark,
> >
> > >But I did not notice a significant change in the overhead of perf tracing based on the fio report,
> > >that is why I am not sure whether this is a regression or an improvement....
> > >Just report the change.

Do you mean f06cc667f79 ("perf: Optimize perf_cgroup_switch")?
Did you have cgroup events when you ran the benchmark?
Otherwise it should not make any differences.

> > >
> >
> > I run the same test on another disk which is almost idle all the time, and still, significant perf sample stats changes detected:
> >
> > 6.7.0-rc1
> >
> > aio_write(49.340% 131501/266521)
> > ext4_file_write_iter(94.567% 124356/131501)
> > iomap_dio_rw(78.494% 97612/124356)
> > __iomap_dio_rw(99.550% 97173/97612)
> > iomap_dio_bio_iter(47.773% 46422/97173) <<----
> > blk_finish_plug(29.931% 29085/97173) <<----
> > iomap_iter(14.082% 13684/97173)
> > kmalloc_trace(1.814% 1763/97173)
> > kiocb_invalidate_pages(0.631% 613/97173)
> > invalidate_inode_pages2_range(0.448% 435/97173)
> > srso_return_thunk(0.320% 311/97173)
> > blk_start_plug(0.167% 162/97173)
> >
> >
> > 6.7.0-rc1 with f06cc667f79 reverted
> >
> > aio_write(49.071% 155873/317649)
> > ext4_file_write_iter(95.211% 148409/155873)
> > iomap_dio_rw(81.816% 121422/148409)
> > __iomap_dio_rw(99.684% 121038/121422)
> > iomap_dio_bio_iter(40.280% 48754/121038) <<---
> > blk_finish_plug(36.760% 44494/121038) <<---
> > iomap_iter(14.657% 17740/121038)
> > kmalloc_trace(1.775% 2149/121038)
> > kiocb_invalidate_pages(0.599% 725/121038)
> > invalidate_inode_pages2_range(0.330% 399/121038)
> > srso_return_thunk(0.263% 318/121038)
> > blk_start_plug(0.169% 204/121038)

I cannot say whether it's good or not just by looking
at these numbers. Maybe fs/block folks know..

> >
> >
> > I remove the 4 occurrences for checking nr_cgroups:
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 683dc086ef10..7583418e94d9 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -690,8 +690,8 @@ static void perf_ctx_disable(struct perf_event_context *ctx, bool cgroup)
> > struct perf_event_pmu_context *pmu_ctx;
> >
> > list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
> > - if (cgroup && !pmu_ctx->nr_cgroups)
> > - continue;
> > + // if (cgroup && !pmu_ctx->nr_cgroups)
> > + // continue;
> > perf_pmu_disable(pmu_ctx->pmu);
> > }
> > }
> > @@ -701,8 +701,8 @@ static void perf_ctx_enable(struct perf_event_context *ctx, bool cgroup)
> > struct perf_event_pmu_context *pmu_ctx;
> >
> > list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
> > - if (cgroup && !pmu_ctx->nr_cgroups)
> > - continue;
> > + // if (cgroup && !pmu_ctx->nr_cgroups)
> > + // continue;
> > perf_pmu_enable(pmu_ctx->pmu);
> > }
> > }
> > @@ -3307,8 +3307,8 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
> > is_active ^= ctx->is_active; /* changed bits */
> >
> > list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
> > - if (cgroup && !pmu_ctx->nr_cgroups)
> > - continue;
> > + // if (cgroup && !pmu_ctx->nr_cgroups)
> > + // continue;
> > __pmu_ctx_sched_out(pmu_ctx, is_active);
> > }
> > }
> >
> >
> > The resulting profiling stats is similar to reverting f06cc667f79.

Yeah, that's because you effectively reverted the change. :)

> > I think there are some cases where pmu_ctx->nr_cgroups is zero but there is still cgroup event within, and some samples are missed, causing the stats changes, just a guess.

I don't know what makes you think so. Do you see
differences in the perf output? Any cgroup events having
smaller number of samples or counts?

Or was the fio benchmark degraded seriously?

Thanks,
Namhyung