Re: [PATCH] perf top: Use evsel's cpus to replace user_requested_cpus

From: Ian Rogers
Date: Tue Dec 12 2023 - 14:27:12 EST


On Tue, Dec 12, 2023 at 10:49 AM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
>
> On Tue, Dec 12, 2023 at 10:31 AM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> >
> > On Tue, Dec 12, 2023 at 10:00:16AM -0800, Ian Rogers wrote:
> > > On Tue, Dec 12, 2023 at 9:23 AM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
> > > >
> > > > On Tue, Dec 12, 2023 at 7:56 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On 2023-12-11 4:13 p.m., Arnaldo Carvalho de Melo wrote:
> > > > > > Em Fri, Dec 08, 2023 at 01:08:55PM -0800, kan.liang@xxxxxxxxxxxxxxx escreveu:
> > > > > >> From: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
> > > > > >>
> > > > > >> perf top errors out on a hybrid machine
> > > > > >> $perf top
> > > > > >>
> > > > > >> Error:
> > > > > >> The cycles:P event is not supported.
> > > > > >>
> > > > > >> The user_requested_cpus may contain CPUs that are invalid for a hybrid
> > > > > >> PMU. It causes perf_event_open to fail.
> > > > > >
> > > > > > ?
> > > > > >
> > > > > > All perf top expects is that the "cycles", the most basic one, be
> > > > > > collected, on all CPUs in the system.
> > > > > >
> > > > >
> > > > > Yes, but for hybrid there is no single "cycles" event which can cover
> > > > > all CPUs.
> > > >
> > > > Does that mean the kernel would reject the legacy "cycles" event
> > > > on hybrid CPUs?
> > >
> > > I believe not. When the extended type isn't set on legacy cycles we
> > > often have the CPU and from that can determine the PMU. The issue is
> > > with the -1 any CPU perf_event_open option. As I was told, the PMU the
> > > event is opened on in this case is the first one registered in the
> > > kernel, on Intel hybrid this could be cpu_core or cpu_atom.. but IIRC
> > > it'll probably be cpu_core. On ARM ¯\_(ツ)_/¯.
> >
> > On ARM it'll be essentially the same as on x86: if you open an event with
> > type==PERF_EVENT_TYPE_HARDWARE (without the extended HW type pointing to a
> > specific PMU), and with cpu==-1, it'll go to an arbitrary CPU PMU, whichever
> > happens to be found by perf_init_event() when iterating over the 'pmus' list.
> >
> > If you open an event with type==PERF_EVENT_TYPE_HARDWARE and cpu!=-1, the event
> > will opened on the appropriate CPU PMU, by virtue of being rejected by others
> > when perf_init_event() iterates over the 'pmus' list.
>
> Ok, that means "cycles" with cpu == -1 would not work well.
>
> I'm curious if it's possible to do some basic work at the event_init()
> like to preserve (common) resource and to do some other work at
> sched to config PMU on the current CPU. So that users can simply
> use "cycles" or "instructions" for their processes.

It should be possible to make things better, especially for standard
events for cycles or instructions, for software that isn't aware of
multiple core PMUs and the extended type field. There are legacy
events like dTLB-speculative-read that may be supported by 1 PMU and
not the other, so what should happen with thread scheduling there?

On RISC-V there was discussion of the legacy to pmu/config mapping
that happens in the driver being something that is handled in the
tool. A lot of the legacy events end up being "<not supported>" which
is a pretty bad user experience, like this on AMD:
```
# perf stat -d -a sleep 1

Performance counter stats for 'system wide':

259,256.21 msec cpu-clock # 257.445
CPUs utilized
11,931 context-switches # 46.020
/sec
264 cpu-migrations # 1.018
/sec
419 page-faults # 1.616
/sec
5,892,812,250 cycles # 0.023
GHz (62.43%)
1,159,909,806 stalled-cycles-frontend # 19.68%
frontend cycles idle (62.48%)
831,668,644 stalled-cycles-backend # 14.11%
backend cycles idle (62.52%)
2,825,351,898 instructions # 0.48
insn per cycle
# 0.41 stalled
cycles per insn (62.54%)
520,403,374 branches # 2.007
M/sec (62.57%)
12,050,970 branch-misses # 2.32% of
all branches (62.55%)
1,117,382,209 L1-dcache-loads # 4.310
M/sec (62.48%)
61,060,457 L1-dcache-load-misses # 5.46% of
all L1-dcache accesses (62.43%)
<not supported> LLC-loads
<not supported> LLC-load-misses

1.007033432 seconds time elapsed
```
So I think the move should be away from legacy events but that doesn't
necessarily mean we can't make them better.

Thanks,
Ian

> Thanks,
> Namhyung