Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

From: Ian Rogers
Date: Wed Nov 22 2023 - 10:35:05 EST


On Wed, Nov 22, 2023 at 5:06 AM Arnaldo Carvalho de Melo
<acme@xxxxxxxxxx> wrote:
>
> Em Wed, Nov 22, 2023 at 12:23:27PM +0900, Hector Martin escreveu:
> > On 2023/11/22 1:38, Ian Rogers wrote:
> > > On Tue, Nov 21, 2023 at 8:15 AM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> > >> On Tue, Nov 21, 2023 at 08:09:37AM -0800, Ian Rogers wrote:
> > >>> On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> > >>>> On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> > >>>>> On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> > >>>>>> On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > >>>>>>> On Tue, 21 Nov 2023 13:40:31 +0000,
> > >>>>>>> Marc Zyngier <maz@xxxxxxxxxx> wrote:
> > >>>>>>>>
> > >>>>>>>> [Adding key people on Cc]
> > >>>>>>>>
> > >>>>>>>> On Tue, 21 Nov 2023 12:08:48 +0000,
> > >>>>>>>> Hector Martin <marcan@xxxxxxxxx> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Perf broke on all Apple ARM64 systems (tested almost everything), and
> > >>>>>>>>> according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > >>>>>>>>
> > >>>>>>>> I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > >>>>>>>> asymmetric ARM platform. It isn't clear what criteria is used to pick
> > >>>>>>>> the PMU, but nothing works anymore.
> > >>>>>>>>
> > >>>>>>>> The saving grace in my case is that Debian still ships a 6.1 perftool
> > >>>>>>>> package, but that's obviously not going to last.
> > >>>>>>>>
> > >>>>>>>> I'm happy to test potential fixes.
> > >>>>>>>
> > >>>>>>> At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > >>>>>>> -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > >>>>>>> CPU):
> > >>>>>>
> > >>>>>> IIUC the tool is doing the wrong thing here and overriding explicit
> > >>>>>> ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> > >>>>>> that ${pmu}'s type and event namespace.
> > >>>>>>
> > >>>>>> Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> > >>>>>> targetted to a specific PMU, it's semantically wrong to rewrite events like
> > >>>>>> this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> > >>>>>> PERF_COUNT_HW_${EVENT}.
> > >>>>>
> > >>>>> If you name a PMU and an event then the event should only be opened on
> > >>>>> that PMU, 100% agree. There's a bunch of output, but when the legacy
> > >>>>> cycles event is opened it appears to be because it was explicitly
> > >>>>> requested.
> > >>>>
> > >>>> I think you've missed that the named PMU events are being erreously transformed
> > >>>> into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
> > >>>>
> > >>>> Opening: apple_firestorm_pmu/cycles/
> > >>>> ------------------------------------------------------------
> > >>>> perf_event_attr:
> > >>>> type 0 (PERF_TYPE_HARDWARE)
> > >>>> size 136
> > >>>> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > >>>> sample_type IDENTIFIER
> > >>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > >>>> disabled 1
> > >>>> inherit 1
> > >>>> enable_on_exec 1
> > >>>> exclude_guest 1
> > >>>> ------------------------------------------------------------
> > >>>> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > >>>>
> > >>>> ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
> > >>>>
> > >>>> Marc said that he bisected the issue down to commit:
> > >>>>
> > >>>> 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
> > >>>>
> > >>>> ... so it looks like something is going wrong when the events are being parsed,
> > >>>> e.g. losing the HW PMU information?
> > >>>
> > >>> Ok, I think I'm getting confused by other things. This looks like the issue.
> > >>>
> > >>> I think it may be working as intended, but not how you intended :-) If
> > >>> a core PMU is listed and then a legacy event, the legacy event should
>
> The point is that "cycles" when prefixed with "pmu/" shouldn't be
> considered "cycles" as HW/0, in that setting it is "cycles" for that
> PMU. (but we only have "cpu_cycles" for at least the a53 and a72 PMUs I
> have access in a Libre Computer rockchip 3399-pc hybrid board, if we use
> it, then we get what we want/had before, see below):
>
> And there is an attempt at using the specified PMU, see the first
> perf_event_open:
>
> root@roc-rk3399-pc:~# strace -e perf_event_open perf stat -vv -e cycles,armv8_cortex_a53/cycles/,armv8_cortex_a72/cycles/ echo
> Using CPUID 0x00000000410fd082
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> config 0x700000000
> disabled 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8perf_event_open({type=PERF_TYPE_HARDWARE, size=0 /* PERF_ATTR_SIZE_??? */, config=0x7<<32|PERF_COUNT_HW_CPU_CYCLES, sample_period=0, sample_type=0, read_format=0, disabled=1, precise_ip=0 /* arbitrary skid */, ...}, 0, -1, -1, PERF_FLAG_FD_CLOEXEC) = -1 ENOENT (No such file or directory)
>
> //// HERE: it tries config=0x7<<32|PERF_COUNT_HW_CPU_CYCLES taking into
> //account the PMU number 0x7
>
> root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a53/type
> 7
> root@roc-rk3399-pc:~#
>
> But then we don't have "cycles" in that PMU:
>
> root@roc-rk3399-pc:~# ls -la /sys/devices/armv8_cortex_a53/events/cycles
> ls: cannot access '/sys/devices/armv8_cortex_a53/events/cycles': No such file or directory
> root@roc-rk3399-pc:~#
>
> Maybe:
>
> root@roc-rk3399-pc:~# taskset -c 5,6 perf stat -v -e armv8_cortex_a53/cpu_cycles/,armv8_cortex_a72/cpu_cycles/ echo
> Using CPUID 0x00000000410fd034
> Control descriptor is not initialized
>
> armv8_cortex_a53/cpu_cycles/: 0 2079000 0
> armv8_cortex_a72/cpu_cycles/: 2488961 2079000 2079000
>
> Performance counter stats for 'echo':
>
> <not counted> armv8_cortex_a53/cpu_cycles/ (0.00%)
> 2488961 armv8_cortex_a72/cpu_cycles/
>
> 0.003449266 seconds time elapsed
>
> 0.003502000 seconds user
> 0.000000000 seconds sys
>
>
> root@roc-rk3399-pc:~# taskset -c 0,1,2,3,4 perf stat -v -e armv8_cortex_a53/cpu_cycles/,armv8_cortex_a72/cpu_cycles/ echo
> Using CPUID 0x00000000410fd034
> Control descriptor is not initialized
>
> armv8_cortex_a53/cpu_cycles/: 2986601 6999416 6999416
> armv8_cortex_a72/cpu_cycles/: 0 6999416 0
>
> Performance counter stats for 'echo':
>
> 2986601 armv8_cortex_a53/cpu_cycles/
> <not counted> armv8_cortex_a72/cpu_cycles/ (0.00%)
>
> 0.011434508 seconds time elapsed
>
> 0.003911000 seconds user
> 0.007454000 seconds sys
>
>
> root@roc-rk3399-pc:~#
>
> root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a53/events/cpu_cycles
> event=0x0011
> root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a72/events/cpu_cycles
> event=0x0011
> root@roc-rk3399-pc:~#
>
> And the syscalls seem sane:
>
> root@roc-rk3399-pc:~# strace -e perf_event_open taskset -c 0,1,2,3,4 perf stat -v -e armv8_cortex_a53/cpu_cycles/,armv8_cortex_a72/cpu_cycles/ echo
> Using CPUID 0x00000000410fd034
> Control descriptor is not initialized
> perf_event_open({type=0x7 /* PERF_TYPE_??? */, size=0x88 /* PERF_ATTR_SIZE_??? */, config=0x11, sample_period=0, sample_type=PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING, disabled=1, inherit=1, enable_on_exec=1, precise_ip=0 /* arbitrary skid */, exclude_guest=1, ...}, 14573, -1, -1, PERF_FLAG_FD_CLOEXEC) = 3
> perf_event_open({type=0x8 /* PERF_TYPE_??? */, size=0x88 /* PERF_ATTR_SIZE_??? */, config=0x11, sample_period=0, sample_type=PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING, disabled=1, inherit=1, enable_on_exec=1, precise_ip=0 /* arbitrary skid */, exclude_guest=1, ...}, 14573, -1, -1, PERF_FLAG_FD_CLOEXEC) = 4
>
> --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=14573, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
> armv8_cortex_a53/cpu_cycles/: 3227098 4480875 4480875
> armv8_cortex_a72/cpu_cycles/: 0 4480875 0
>
> Performance counter stats for 'echo':
>
> 3227098 armv8_cortex_a53/cpu_cycles/
> <not counted> armv8_cortex_a72/cpu_cycles/ (0.00%)
>
> 0.008381759 seconds time elapsed
>
> 0.004064000 seconds user
> 0.004121000 seconds sys
>
>
> --- SIGCHLD {si_signo=SIGCHLD, si_code=SI_USER, si_pid=14572, si_uid=0} ---
> +++ exited with 0 +++
> root@roc-rk3399-pc:~#
>
> As:
>
> root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a53/type
> 7
> root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a72/type
> 8
> root@roc-rk3399-pc:~#
>
> See the type=0x7 and type=0x8.
>
> So what we need here seems to be to translate the generic term "cycles"
> to "cpu_cycles" when a PMU is explicitely passed in the event name and
> it doesn't have "cycles" and then just retry.

The PMU driver does the legacy to raw encoding translation, this is an
assumption the tool has of core PMUs. You can see ARM's PMU driver
doing the mapping here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/arm_pmuv3.c#n40

Thanks,
Ian


> - Arnaldo