Re: [PATCH] perf print-events: make is_event_supported() more robust

From: Mark Rutland
Date: Wed Jan 24 2024 - 10:49:03 EST


On Sat, Jan 20, 2024 at 10:27:33AM -0800, Ian Rogers wrote:
> On Tue, Jan 16, 2024 at 9:04 AM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> >
> > Currently the perf tool doesn't deteect support for extneded event types
>
> nit: s/deteect/detect/
> nit: s/extneded/extended/

> > Thus is_event_supported() will fail to detect support for any events
> > targetting an Apple M1/M2 PMU, even where events would be supported with
>
> nit: s/targetting/targeting/

> > This patch updates is_event_supported() to additionally try opening
> > events with perf_event_attr::exclude_guest set, allowing support for
> > events to be detected on Apple M1/M2 systems. I beleive that this is
>
> nit: s/beleive/believe/

Whoops; I've fixed those in my local tree now.

[...]

> > Hector, Marc, I'd appreciate if either of you could give this a spin on
> > your M1/M2 machines. I've given it local testing with the arm_pmuv3
> > driver modified to behave the same as the apple_m1_pmu driver (requiring
> > exclude_guest, having a 'cycles' event in sysfs), but that might not
> > perfectly replicate your setup.
> >
> > The patch is based on the 'perf-tools-for-v6.8-1-2024-01-09' tag in the
> > perf-tools tree:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools.git/
> >
> > ... and I've pushed it out to the 'perf-tools/event-supported-filters'
> > branch in my tree:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/
> >
> > This patch *should* make it possible to do:
> >
> > perf stat -e cycles ./workload
> > perf stat -e instructions ./workload
> >
> > ... with those 'cycles' and 'instructions' events being automatically
> > expanded and reported as separate events per-PMU, which is a nice
> > quality-of-life improvement.
> >
> > Comparing before and after this patch:
> >
> > | # ./perf-before stat -e cycles true
> > |
> > | Performance counter stats for 'true':
> > |
> > | <not counted> cycles (0.00%)
> > |
> > | 0.000990250 seconds time elapsed
> > |
> > | 0.000934000 seconds user
> > | 0.000000000 seconds sys
> > |
> > | # ./perf-after stat -e cycles true
> > |
> > | Performance counter stats for 'true':
> > |
> > | 965175 armv8_pmuv3_0/cycles/
> > | <not counted> armv8_pmuv3_1/cycles/ (0.00%)
> > | <not counted> armv8_pmuv3_2/cycles/ (0.00%)
> > | <not counted> armv8_pmuv3_3/cycles/ (0.00%)
> > |
> > | 0.000836555 seconds time elapsed
> > |
> > | 0.000884000 seconds user
> > | 0.000000000 seconds sys
>
> Just to check, this is the expected expansion of cycles? I'm unclear
> why 4 core PMUs.

Yep; I had a fake big.LITTLE setup with four distinct microarchitectures (one
per CPU), so the expansion above is expected. I had meant to explain that in
the notes along with the other driver modifications, but I forgot, sorry!

> > This *shouldn't* change the interpetation of named-pmu events, e.g.
> >
> > perf stat -e apple_whichever_pmu/cycles/ ./workload
> >
> > ... should behave the same as without this patch
> >
> > Comparing before and after this patch:
> >
> > | # ./perf-before stat -e armv8_pmuv3_0/cycles/ -e armv8_pmuv3_1/cycles/ -e armv8_pmuv3_2/cycles/ -e armv8_pmuv3_3/cycles/ true
> > |
> > | Performance counter stats for 'true':
> > |
> > | <not counted> armv8_pmuv3_0/cycles/ (0.00%)
> > | <not counted> armv8_pmuv3_1/cycles/ (0.00%)
> > | <not counted> armv8_pmuv3_2/cycles/ (0.00%)
> > | 901415 armv8_pmuv3_3/cycles/
> > |
> > | 0.000756590 seconds time elapsed
> > |
> > | 0.000811000 seconds user
> > | 0.000000000 seconds sys
> > |
> > | # ./perf-after stat -e armv8_pmuv3_0/cycles/ -e armv8_pmuv3_1/cycles/ -e armv8_pmuv3_2/cycles/ -e armv8_pmuv3_3/cycles/ true
> > |
> > | Performance counter stats for 'true':
> > |
> > | 923314 armv8_pmuv3_0/cycles/
> > | <not counted> armv8_pmuv3_1/cycles/ (0.00%)
> > | <not counted> armv8_pmuv3_2/cycles/ (0.00%)
> > | <not counted> armv8_pmuv3_3/cycles/ (0.00%)
> > |
> > | 0.000782420 seconds time elapsed
> > |
> > | 0.000836000 seconds user
> > | 0.000000000 seconds sys
> >
> > One thing I'm still looing into is that this doesn't seem to do anything
> > for a default perf stat session, e.g.
> >
> > perf stat ./workload
> >
> > ... doesn't automatically expand the implicitly-created events into per-pmu
> > events.
>
> Ugh, weak symbols. x86 has overridden the default adding of attributes
> to do it for hybrid:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/arch/x86/util/evlist.c?h=perf-tools-next#n36
> I think we should lose the adding events via attributes and just go to
> using parse events for everything. I'll see if I can do some cleanup
> and that should resolve this - I also want to cleanup the default
> events/metrics and the detailed ones as we can drop the unsupported
> events, etc.

Ok; so IIUC we can treat that as a separate problem? I'm happy to test/review
patches there.

> > Comparing before and after this patch:
> >
> > | # ./perf-before stat true
> > |
> > | Performance counter stats for 'true':
> > |
> > | 0.42 msec task-clock # 0.569 CPUs utilized
> > | 0 context-switches # 0.000 /sec
> > | 0 cpu-migrations # 0.000 /sec
> > | 38 page-faults # 89.796 K/sec
> > | <not counted> cycles (0.00%)
> > | <not counted> instructions (0.00%)
> > | <not counted> branches (0.00%)
> > | <not counted> branch-misses (0.00%)
> > |
> > | 0.000744185 seconds time elapsed
> > |
> > | 0.000795000 seconds user
> > | 0.000000000 seconds sys
> > |
> > | # ./perf-after stat true
> > |
> > | Performance counter stats for 'true':
> > |
> > | 0.43 msec task-clock # 0.582 CPUs utilized
> > | 0 context-switches # 0.000 /sec
> > | 0 cpu-migrations # 0.000 /sec
> > | 38 page-faults # 88.960 K/sec
> > | <not counted> cycles (0.00%)
> > | <not counted> instructions (0.00%)
> > | <not counted> branches (0.00%)
> > | <not counted> branch-misses (0.00%)
> > |
> > | 0.000734120 seconds time elapsed
> > |
> > | 0.000786000 seconds user
> > | 0.000000000 seconds sys
> >
> > Ian, how does that behave on x86? Is that the same, or do the default
> > events get expanded?
>
> The default events are expanded, the not counted is a feature of a
> fast binary (true here). I'm trying to remove custom code paths so
> that things like this don't happen, sorry that you've come across
> another instance but at least I can fix it.

Huh; I'm surprised that works with the named-pmu events, since that's the same
'true' binary.

Is there anything I should go look at for that?

Mark.