Re: [PATCH] perf top: Use evsel's cpus to replace user_requested_cpus

From: Mark Rutland
Date: Fri Jan 05 2024 - 07:31:56 EST


On Fri, Dec 15, 2023 at 02:49:14PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Fri, Dec 15, 2023 at 04:51:49PM +0000, Mark Rutland escreveu:
> > On Fri, Dec 15, 2023 at 12:36:10PM -0300, Arnaldo Carvalho de Melo wrote:
> > > Em Tue, Dec 12, 2023 at 06:31:05PM +0000, Mark Rutland escreveu:
> > > > On ARM it'll be essentially the same as on x86: if you open an event with
> > > > type==PERF_EVENT_TYPE_HARDWARE (without the extended HW type pointing to a
> > > > specific PMU), and with cpu==-1, it'll go to an arbitrary CPU PMU, whichever
> > > > happens to be found by perf_init_event() when iterating over the 'pmus' list.
>
> > > > If you open an event with type==PERF_EVENT_TYPE_HARDWARE and cpu!=-1, the event
> > > > will opened on the appropriate CPU PMU, by virtue of being rejected by others
> > > > when perf_init_event() iterates over the 'pmus' list.
>
> > > The way that it is working non on my intel hybrid system, with Kan's
> > > patch, is equivalent to using this on the RK3399pc board I have:
>
> > > root@roc-rk3399-pc:~# perf top -e armv8_cortex_a72/cycles/P,armv8_cortex_a53/cycles/P
>
> > > Wouldn't be better to make 'perf top' on ARM work the way is being done
> > > in x86 now, i.e. default to opening the two events, one per PMU and
> > > allow the user to switch back and forth using the TUI/stdio?
>
> > TBH, for perf top I don't know *which* behaviour is preferable, but I agree
> > that it'd be good for x86 and arm to work in the same way.
>
> Right, reducing the difference in the user experience accross arches.
>
> > For design-cleanliness and consistency with other commands I can see that
> > opening those separately is probably for the best, but for typical usage of
> > perf top it's really nice to have those presented together without having to
> > tab back-and-forth between the distinct PMUs, and so the existing behaviour of
>
> Humm, so you would want two histogram viewers, one for each PMU, side by
> side?

I had meant as an aggregated view (the same as what you'd get if you opened one
plain PERF_TYPE_HARDWARE event per cpu); I hadn't considered side-by-side views.

To be clear, I'm personally happy to tab between per-pmu views, and if that's
how x86 works today for heterogeneous PMUs, I'm fine with the same for
arm/arm64. I was trying to say that I didn't have a strong preference.

> > using CPU-bound PERF_EVENT_TYPE_HARDWARE events is arguably nicer for the user.
>
> So, on ARM64, start the following 'perf trace' session, then run the
> stock 'perf top':
>
> root@roc-rk3399-pc:~# perf trace -e perf_event_open
> <SNIP calls doing capability queries and setting up sideband stuff>
> 535.764 ( 0.015 ms): perf/15627 perf_event_open(attr_uptr: { type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD, read_format: ID|LOST, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, sample_id_all: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1 }, pid: -1, group_fd: -1, flags: FD_CLOEXEC) = 19
> 535.783 ( 0.067 ms): perf/15627 perf_event_open(attr_uptr: { type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD, read_format: ID|LOST, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, sample_id_all: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1 }, pid: -1, cpu: 1, group_fd: -1, flags: FD_CLOEXEC) = 28
> 535.854 ( 0.063 ms): perf/15627 perf_event_open(attr_uptr: { type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD, read_format: ID|LOST, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, sample_id_all: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1 }, pid: -1, cpu: 2, group_fd: -1, flags: FD_CLOEXEC) = 29
> 535.920 ( 0.015 ms): perf/15627 perf_event_open(attr_uptr: { type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD, read_format: ID|LOST, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, sample_id_all: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1 }, pid: -1, cpu: 3, group_fd: -1, flags: FD_CLOEXEC) = 30
> 535.939 ( 0.016 ms): perf/15627 perf_event_open(attr_uptr: { type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD, read_format: ID|LOST, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, sample_id_all: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1 }, pid: -1, cpu: 4, group_fd: -1, flags: FD_CLOEXEC) = 31
> 535.959 ( 0.011 ms): perf/15627 perf_event_open(attr_uptr: { type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD, read_format: ID|LOST, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, sample_id_all: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1 }, pid: -1, cpu: 5, group_fd: -1, flags: FD_CLOEXEC) = 32
>
> root@roc-rk3399-pc:~# grep "CPU part" /proc/cpuinfo | uniq -c
> 4 CPU part : 0xd03
> 2 CPU part : 0xd08
> root@roc-rk3399-pc:~#
>
> It is already doing what you suggest, right? PERF_TYPE_HARDWARE, one
> counter per CPU, maps to armv8_cortex_a72/cycles/P and
> armv8_cortex_a53/cycles/P.

Sounds like it; as above I'm happy for that to change to per-PMU views.

> One thing I'm thinking is that we should split this per PMU at the
> hist_entry, so that we could show how many samples/events came from each
> of them...

That sounds sensible to me.

Mark.

>
> - Arnaldo
>
> > I don't have a strong feeling either way; I'm personally happy so long as
> > explicit pmu_name/event/ events don't get silently converted into
> > PERF_EVENT_TYPE_HARDWARE events, and as long as we correctly set the extended
> > HW type when we decide to use that.
> >
> > Thanks,
> > Mark.
> >
> > > Kan, I also noticed that the name of the event is:
> > >
> > > 1K cpu_atom/cycles:P/ ◆
> > > 11K cpu_core/cycles:P/
> > >
> > > If I try to use that on the command line:
> > >
> > > root@number:~# perf top -e cpu_atom/cycles:P/
> > > event syntax error: 'cpu_atom/cycles:P/'
> > > \___ Bad event or PMU
> > >
> > > Unable to find PMU or event on a PMU of 'cpu_atom'
> > >
> > > Initial error:
> > > event syntax error: 'cpu_atom/cycles:P/'
> > > \___ unknown term 'cycles:P' for pmu 'cpu_atom'
> > >
> > > valid terms: event,pc,edge,offcore_rsp,ldlat,inv,umask,cmask,config,config1,config2,config3,name,period,freq,branch_type,time,call-graph,stack-size,no-inherit,inherit,max-stack,nr,no-overwrite,overwrite,driver-config,percore,aux-output,aux-sample-size,metric-id,raw,legacy-cache,hardware
> > > Run 'perf list' for a list of valid events
> > >
> > > Usage: perf top [<options>]
> > >
> > > -e, --event <event> event selector. use 'perf list' to list available events
> > > root@number:~#
> > >
> > > It should be:
> > >
> > > "cpu_atom/cycles/P"