Re:[Regression or Fix]perf: profiling stats sigificantly changed for aio_write/read(ext4) between 6.7.0-rc1 and 6.6.0

From: David Wang
Date: Tue Nov 14 2023 - 10:21:23 EST




At 2023-11-14 12:31:12, "David Wang" <00107082@xxxxxxx> wrote:
>Hi,
>
>I was making kernel profiling to identify kernel changes between revisions under the load of fio benchmark,

>But I did not notice a significant change in the overhead of perf tracing based on the fio report,
>that is why I am not sure whether this is a regression or an improvement....
>Just report the change.
>

I run the same test on another disk which is almost idle all the time, and still, significant perf sample stats changes detected:

6.7.0-rc1

aio_write(49.340% 131501/266521)
ext4_file_write_iter(94.567% 124356/131501)
iomap_dio_rw(78.494% 97612/124356)
__iomap_dio_rw(99.550% 97173/97612)
iomap_dio_bio_iter(47.773% 46422/97173) <<----
blk_finish_plug(29.931% 29085/97173) <<----
iomap_iter(14.082% 13684/97173)
kmalloc_trace(1.814% 1763/97173)
kiocb_invalidate_pages(0.631% 613/97173)
invalidate_inode_pages2_range(0.448% 435/97173)
srso_return_thunk(0.320% 311/97173)
blk_start_plug(0.167% 162/97173)


6.7.0-rc1 with f06cc667f79 reverted

aio_write(49.071% 155873/317649)
ext4_file_write_iter(95.211% 148409/155873)
iomap_dio_rw(81.816% 121422/148409)
__iomap_dio_rw(99.684% 121038/121422)
iomap_dio_bio_iter(40.280% 48754/121038) <<---
blk_finish_plug(36.760% 44494/121038) <<---
iomap_iter(14.657% 17740/121038)
kmalloc_trace(1.775% 2149/121038)
kiocb_invalidate_pages(0.599% 725/121038)
invalidate_inode_pages2_range(0.330% 399/121038)
srso_return_thunk(0.263% 318/121038)
blk_start_plug(0.169% 204/121038)


I remove the 4 occurrences for checking nr_cgroups:

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 683dc086ef10..7583418e94d9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -690,8 +690,8 @@ static void perf_ctx_disable(struct perf_event_context *ctx, bool cgroup)
struct perf_event_pmu_context *pmu_ctx;

list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
- if (cgroup && !pmu_ctx->nr_cgroups)
- continue;
+ // if (cgroup && !pmu_ctx->nr_cgroups)
+ // continue;
perf_pmu_disable(pmu_ctx->pmu);
}
}
@@ -701,8 +701,8 @@ static void perf_ctx_enable(struct perf_event_context *ctx, bool cgroup)
struct perf_event_pmu_context *pmu_ctx;

list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
- if (cgroup && !pmu_ctx->nr_cgroups)
- continue;
+ // if (cgroup && !pmu_ctx->nr_cgroups)
+ // continue;
perf_pmu_enable(pmu_ctx->pmu);
}
}
@@ -3307,8 +3307,8 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
is_active ^= ctx->is_active; /* changed bits */

list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
- if (cgroup && !pmu_ctx->nr_cgroups)
- continue;
+ // if (cgroup && !pmu_ctx->nr_cgroups)
+ // continue;
__pmu_ctx_sched_out(pmu_ctx, is_active);
}
}


The resulting profiling stats is similar to reverting f06cc667f79.
I think there are some cases where pmu_ctx->nr_cgroups is zero but there is still cgroup event within, and some samples are missed, causing the stats changes, just a guess.


David Wang