Re: [PATCH] perf test: Retry without grouping for all metrics test

From: Ian Rogers
Date: Wed Jun 14 2023 - 12:41:07 EST


On Wed, Jun 14, 2023 at 2:07 AM Sandipan Das <sandipan.das@xxxxxxx> wrote:
>
> There are cases where a metric uses more events than the number of
> counters. E.g. AMD Zen, Zen 2 and Zen 3 processors have four data fabric
> counters but the "nps1_die_to_dram" metric has eight events. By default,
> the constituent events are placed in a group. Since the events cannot be
> scheduled at the same time, the metric is not computed. The all metrics
> test also fails because of this.

Thanks Sandipan. So this is exposing a bug in the AMD data fabric PMU
driver. When the events are added the driver should create a fake PMU,
check that adding the group is valid and if not fail. The failure is
picked up by the tool and it will remove the group.

I appreciate the need for a time machine to make such a fix work. To
workaround the issue with the metrics add:
"MetricConstraint": "NO_GROUP_EVENTS",
to each metric in the json.

> Before announcing failure, the test can try multiple options for each
> available metric. After system-wide mode fails, retry once again with
> the "--metric-no-group" option.
>
> E.g.
>
> $ sudo perf test -v 100
>
> Before:
>
> 100: perf all metrics test :
> --- start ---
> test child forked, pid 672731
> Testing branch_misprediction_ratio
> Testing all_remote_links_outbound
> Testing nps1_die_to_dram
> Metric 'nps1_die_to_dram' not printed in:
> Error:
> Invalid event (dram_channel_data_controller_4) in per-thread mode, enable system wide with '-a'.

This error doesn't relate to grouping, so I'm confused about having it
in the commit message, aside from the test failure.

Thanks,
Ian

> Testing macro_ops_dispatched
> Testing all_l2_cache_accesses
> Testing all_l2_cache_hits
> Testing all_l2_cache_misses
> Testing ic_fetch_miss_ratio
> Testing l2_cache_accesses_from_l2_hwpf
> Testing l2_cache_misses_from_l2_hwpf
> Testing op_cache_fetch_miss_ratio
> Testing l3_read_miss_latency
> Testing l1_itlb_misses
> test child finished with -1
> ---- end ----
> perf all metrics test: FAILED!
>
> After:
>
> 100: perf all metrics test :
> --- start ---
> test child forked, pid 672887
> Testing branch_misprediction_ratio
> Testing all_remote_links_outbound
> Testing nps1_die_to_dram
> Testing macro_ops_dispatched
> Testing all_l2_cache_accesses
> Testing all_l2_cache_hits
> Testing all_l2_cache_misses
> Testing ic_fetch_miss_ratio
> Testing l2_cache_accesses_from_l2_hwpf
> Testing l2_cache_misses_from_l2_hwpf
> Testing op_cache_fetch_miss_ratio
> Testing l3_read_miss_latency
> Testing l1_itlb_misses
> test child finished with 0
> ---- end ----
> perf all metrics test: Ok
>
> Reported-by: Ayush Jain <ayush.jain3@xxxxxxx>
> Signed-off-by: Sandipan Das <sandipan.das@xxxxxxx>
> ---
> tools/perf/tests/shell/stat_all_metrics.sh | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/tools/perf/tests/shell/stat_all_metrics.sh b/tools/perf/tests/shell/stat_all_metrics.sh
> index 54774525e18a..1e88ea8c5677 100755
> --- a/tools/perf/tests/shell/stat_all_metrics.sh
> +++ b/tools/perf/tests/shell/stat_all_metrics.sh
> @@ -16,6 +16,13 @@ for m in $(perf list --raw-dump metrics); do
> then
> continue
> fi
> + # Failed again, possibly there are not enough counters so retry system wide
> + # mode but without event grouping.
> + result=$(perf stat -M "$m" --metric-no-group -a sleep 0.01 2>&1)
> + if [[ "$result" =~ ${m:0:50} ]]
> + then
> + continue
> + fi
> # Failed again, possibly the workload was too small so retry with something
> # longer.
> result=$(perf stat -M "$m" perf bench internals synthesize 2>&1)
> --
> 2.34.1
>