Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

From: Ian Rogers
Date: Tue Nov 21 2023 - 10:41:37 EST


On Tue, Nov 21, 2023 at 7:24 AM Marc Zyngier <maz@xxxxxxxxxx> wrote:
>
> On Tue, 21 Nov 2023 13:40:31 +0000,
> Marc Zyngier <maz@xxxxxxxxxx> wrote:
> >
> > [Adding key people on Cc]
> >
> > On Tue, 21 Nov 2023 12:08:48 +0000,
> > Hector Martin <marcan@xxxxxxxxx> wrote:
> > >
> > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> >
> > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > the PMU, but nothing works anymore.
> >
> > The saving grace in my case is that Debian still ships a 6.1 perftool
> > package, but that's obviously not going to last.
> >
> > I'm happy to test potential fixes.
>
> At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> CPU):
>
> <quote>
> maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 0 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> apple_firestorm_pmu/cycles/ -e cycles ls
> Using CPUID 0x00000000612f0280
> Attempt to add: apple_icestorm_pmu/cycles=0/
> ..after resolving event: apple_icestorm_pmu/cycles=0/
> Opening: unknown-hardware:HG
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> config 0xb00000000
> disabled 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> sys_perf_event_open failed, error -95
> Attempt to add: apple_firestorm_pmu/cycles=0/
> ..after resolving event: apple_firestorm_pmu/cycles=0/
> Control descriptor is not initialized
> Opening: apple_icestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3
> Opening: apple_firestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> Opening: cycles
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 5
> arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> builtin-daemon.o builtin-list.c builtin-version.c perf ui
> builtin-data.c builtin-list.o builtin-version.o perf-archive util
> builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> builtin-diff.c builtin-mem.c command-list.txt perf.c
> apple_icestorm_pmu/cycles/: -1: 0 873709 0
> apple_firestorm_pmu/cycles/: -1: 0 873709 0
> cycles: -1: 0 873709 0
> apple_icestorm_pmu/cycles/: 0 873709 0
> apple_firestorm_pmu/cycles/: 0 873709 0
> cycles: 0 873709 0
>
> Performance counter stats for 'ls':
>
> <not counted> apple_icestorm_pmu/cycles/ (0.00%)
> <not counted> apple_firestorm_pmu/cycles/ (0.00%)
> <not counted> cycles (0.00%)
>
> 0.000002250 seconds time elapsed
>
> 0.000000000 seconds user
> 0.000000000 seconds sys
> </quote>
>
> If I run the same thing on another CPU cluster (firestorm), I get
> this:
>
> <quote>
> maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 2 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> apple_firestorm_pmu/cycles/ -e cycles ls
> Using CPUID 0x00000000612f0280
> Attempt to add: apple_icestorm_pmu/cycles=0/
> ..after resolving event: apple_icestorm_pmu/cycles=0/
> Opening: unknown-hardware:HG
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> config 0xb00000000
> disabled 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> sys_perf_event_open failed, error -95
> Attempt to add: apple_firestorm_pmu/cycles=0/
> ..after resolving event: apple_firestorm_pmu/cycles=0/
> Control descriptor is not initialized
> Opening: apple_icestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 3
> Opening: apple_firestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 4
> Opening: cycles
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 5
> arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> builtin-daemon.o builtin-list.c builtin-version.c perf ui
> builtin-data.c builtin-list.o builtin-version.o perf-archive util
> builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> builtin-diff.c builtin-mem.c command-list.txt perf.c
> apple_icestorm_pmu/cycles/: -1: 1035101 469125 469125
> apple_firestorm_pmu/cycles/: -1: 1035035 469125 469125
> cycles: -1: 1034653 469125 469125
> apple_icestorm_pmu/cycles/: 1035101 469125 469125
> apple_firestorm_pmu/cycles/: 1035035 469125 469125
> cycles: 1034653 469125 469125
>
> Performance counter stats for 'ls':
>
> 1,035,101 apple_icestorm_pmu/cycles/
> 1,035,035 apple_firestorm_pmu/cycles/
> 1,034,653 cycles
>
> 0.000001333 seconds time elapsed
>
> 0.000000000 seconds user
> 0.000000000 seconds sys
> </quote>
>
> which doesn't make any sense either. I really don't understand what
> this PERF_TYPE_HARDWARE does here (the *real* types are 10 and 11),
> nor what this 'cycle=0' stuff is.

Hi Marc,

I'm unclear if you are running a newer perf tool on an older kernel or
not. In any case I'll assume the kernel and perf tool versions match.
In Linux 6.6 this patch was added to the ARM PMU:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/perf/arm_pmu.c?id=5c816728651ae425954542fed64d21d40cb75a9f

My guess is that the apple_icestorm_pmu requires a similar patch. The
perf tool is supposed to not use extended types when they aren't
supported:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmus.c?h=perf-tools-next#n532
So I share your confusion as to why something broke.

PERF_TYPE_HARDWARE is a legacy type where there are hardcoded type and
config values that correspond to an event. The PMU driver turns legacy
events into the real types. On BIG.little systems if the legacy events
are monitoring a task a different event is needed for each PMU (ie >1
event). In your example you are monitoring 'ls', a task, and so
different cycles events are necessary. In the high 32-bits (the
extended type) the PMU is identified.

Thanks for reporting the issue,
Ian

> /puzzled
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.