Re: [PATCH v3 4/5] perf evlist: Respect all_cpus when setting user_requested_cpus

From: Adrian Hunter
Date: Fri Apr 29 2022 - 07:34:53 EST


On 28/04/22 23:49, Ian Rogers wrote:
> On Thu, Apr 28, 2022 at 1:16 PM Adrian Hunter <adrian.hunter@xxxxxxxxx <mailto:adrian.hunter@xxxxxxxxx>> wrote:
>
> On 8/04/22 06:56, Ian Rogers wrote:
> > If all_cpus is calculated it represents the merge/union of all
> > evsel cpu maps. By default user_requested_cpus is computed to be
> > the online CPUs. For uncore events, it is often the case currently
> > that all_cpus is a subset of user_requested_cpus. Metrics printed
> > without aggregation and with metric-only, in print_no_aggr_metric,
> > iterate over user_requested_cpus assuming every CPU has a metric to
> > print. For each CPU the prefix is printed, but then if the
> > evsel's cpus doesn't contain anything you get an empty line like
> > the following on a 2 socket 36 core SkylakeX:
> >
> > ```
> > $ perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000
> >      1.000453137 CPU0                       0.00
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137 CPU18                      0.00
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      1.000453137
> >      2.003717143 CPU0                       0.00
> > ...
> > ```
> >
> > While it is possible to be lazier in printing the prefix and
> > trailing newline, having user_requested_cpus not be a subset of
> > all_cpus is preferential so that wasted work isn't done elsewhere
> > user_requested_cpus is used. The change modifies user_requested_cpus
> > to be the intersection of user specified CPUs, or default all online
> > CPUs, with the CPUs computed through the merge of all evsel cpu maps.
> >
> > New behavior:
> > ```
> > $ perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000
> >      1.001086325 CPU0                       0.00
> >      1.001086325 CPU18                      0.00
> >      2.003671291 CPU0                       0.00
> >      2.003671291 CPU18                      0.00
> > ...
> > ```
> >
> > Signed-off-by: Ian Rogers <irogers@xxxxxxxxxx <mailto:irogers@xxxxxxxxxx>>
> > ---
> >  tools/perf/util/evlist.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> > index 52ea004ba01e..196d57b905a0 100644
> > --- a/tools/perf/util/evlist.c
> > +++ b/tools/perf/util/evlist.c
> > @@ -1036,6 +1036,13 @@ int evlist__create_maps(struct evlist *evlist, struct target *target)
> >       if (!cpus)
> >               goto out_delete_threads;
>
> > +     if (evlist->core.all_cpus) {
> > +             struct perf_cpu_map *tmp;
> > +
> > +             tmp = perf_cpu_map__intersect(cpus, evlist->core.all_cpus);
>
> Isn't an uncore PMU represented as being on CPU0 actually
> collecting data that can be due to any CPU.
>
>
> This is correct but the counter is only opened on CPU0 as the all_cpus cpu_map will only contain CPU0. Trying to dump the counter for say CPU1 will fail as there is no counter there. This is why the metric-only output isn't displaying anything above.

That's not what happens for me:

$ perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000 -- sleep 1
# time CPU DRAM_BW_Use
1.001114691 CPU0 0.00
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.001114691
1.002265387 CPU0 0.00
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387
1.002265387

perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000 -C 1 -- sleep 1
# time CPU DRAM_BW_Use
1.001100827 CPU1 0.00
1.002128527 CPU1 0.00


>  
>
> Or for an uncore PMU represented as being on CPU0-CPU4 on a
> 4 core 8 hyperthread processor, actually 1 PMU per core.
>
>
> In this case I believe the CPU map will be CPU0, CPU2, CPU4, CPU6. To get the core counter for hyperthreads on CPU0 and CPU1 you read on CPU0, there is no counter on CPU1 and trying to read it will fail as the counters are indexed by a cpu map index into the all_cpus . Not long ago I cleaned up the cpu_map code as there was quite a bit of confusion over cpus and indexes which were both of type int.
>  
>
> So I am not sure intersection makes sense.
>
> Also it is not obvious what happens with hybrid CPUs or
> per thread recording.
>
>
> The majority of code is using all_cpus, and so is unchanged by this change.

I am not sure what you mean. Every tool uses this code. It affects everything when using PMUs with their own cpus.

Code that is affected, when it say needs to use counters, needs to check that the user CPU was valid in all_cpus, and use the all_cpus index. The metric-only output could be fixed in the same way, ie don't display lines when the user_requested_cpu isn't in all_cpus. I prefered to solve the problem this way as it is inefficient  to be processing cpus where there can be no corresponding counters, etc. We may be setting something like affinity unnecessarily - although that doesn't currently happen as that code iterates over all_cpus. I also think it is confusing from its name when the variable all_cpus is for a cpu_map that contains fewer cpus than user_requested_cpus - albeit that was worse when user_requested_cpus was called just cpus.
>
> It could be hybrid or intel-pt have different assumptions on these cpu_maps. I don't have access to a hybrid test system. For intel-pt it'd be great if there were a perf test. Given that most code is using all_cpus and was cleaned up as part of the cpu_map work, I believe the change to be correct.

Mainly what happens if you try to intersect all_cpus with dummy cpus?

>
> Thanks,
> Ian
>
>
> > +             perf_cpu_map__put(cpus);
> > +             cpus = tmp;
> > +     }
> >       evlist->core.has_user_cpus = !!target->cpu_list && !target->hybrid;
>
> >       perf_evlist__set_maps(&evlist->core, cpus, threads);
>