Re: [PATCH v1 00/40] Fix perf on Intel hybrid CPUs

From: Liang, Kan
Date: Wed Apr 26 2023 - 09:53:57 EST

Next message: Aradhya Bhatia: "Re: [PATCH v2 1/2] arm64: dts/ti: am65x: Add Rocktech OLDI panel DT overlay"
Previous message: Liang Yang: "Re: [PATCH v1 4/5] mtd: rawnand: meson: clear OOB buffer before read"
In reply to: Ian Rogers: "[PATCH v1 39/40] perf jevents: Don't rewrite metrics across PMUs"
Next in thread: Arnaldo Carvalho de Melo: "Re: [PATCH v1 00/40] Fix perf on Intel hybrid CPUs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2023-04-26 3:00 a.m., Ian Rogers wrote:
> TL;DR: hybrid doesn't crash, json metrics work on hybrid on both PMUs
> or individually, event parsing doesn't always scan all PMUs, more and
> new tests that also run without hybrid, less code.
>
> The first patches were previously posted to improve metrics here:
> "perf stat: Introduce skippable evsels"
> https://lore.kernel.org/all/20230414051922.3625666-1-irogers@xxxxxxxxxx/
> "perf vendor events intel: Add xxx metric constraints"
> https://lore.kernel.org/all/20230419005423.343862-1-irogers@xxxxxxxxxx/
>
> Next are some general test improvements.
>
> Next event parsing is rewritten to not scan all PMUs for the benefit
> of raw and legacy cache parsing, instead these are handled by the
> lexer and a new term type. This ultimately removes the need for the
> event parser for hybrid to be recursive as legacy cache can be just a
> term. Tests are re-enabled for events with hyphens, so AMD's
> branch-brs event is now parsable.
>
> The cputype option is made a generic pmu filter flag and is tested
> even on non-hybrid systems.
>
> The final patches address specific json metric issues on hybrid, in
> both the json metrics and the metric code. They also bring in a new
> json option to not group events when matching a metricgroup, this
> helps reduce counter pressure for TopdownL1 and TopdownL2 metric
> groups. The updates to the script that updates the json are posted in:
> https://github.com/intel/perfmon/pull/73
>
> The patches add slightly more code than they remove, in areas like
> better json metric constraints and tests, but in the core util code,
> the removal of hybrid is a net reduction:
> 20 files changed, 631 insertions(+), 951 deletions(-)
>
> There's specific detail with each patch, but for now here is the 6.3
> output followed by that from perf-tools-next with the patch series
> applied. The tool is running on an Alderlake CPU on an elderly 5.15
> kernel:
>
> Events on hybrid that parse and pass tests:
> '''
> $ perf-6.3 version
> perf version 6.3.rc7.gb7bc77e2f2c7
> $ perf-6.3 test
> ...
> 6.1: Test event parsing : FAILED!
> ...
> $ perf test
> ...
> 6: Parse event definition strings :
> 6.1: Test event parsing : Ok
> 6.2: Parsing of all PMU events from sysfs : Ok
> 6.3: Parsing of given PMU events from sysfs : Ok
> 6.4: Parsing of aliased events from sysfs : Skip (no aliases in sysfs)
> 6.5: Parsing of aliased events : Ok
> 6.6: Parsing of terms (event modifiers) : Ok
> ...
> '''
>
> No event/metric running with json metrics and TopdownL1 on both PMUs:
> '''
> $ perf-6.3 stat -a sleep 1
>
> Performance counter stats for 'system wide':
>
> 24,073.58 msec cpu-clock # 23.975 CPUs utilized
> 350 context-switches # 14.539 /sec
> 25 cpu-migrations # 1.038 /sec
> 66 page-faults # 2.742 /sec
> 21,257,199 cpu_core/cycles/ # 883.009 K/sec
> 2,162,192 cpu_atom/cycles/ # 89.816 K/sec
> 6,679,379 cpu_core/instructions/ # 277.457 K/sec
> 753,197 cpu_atom/instructions/ # 31.287 K/sec
> 1,300,647 cpu_core/branches/ # 54.028 K/sec
> 148,652 cpu_atom/branches/ # 6.175 K/sec
> 117,429 cpu_core/branch-misses/ # 4.878 K/sec
> 14,396 cpu_atom/branch-misses/ # 598.000 /sec
> 123,097,644 cpu_core/slots/ # 5.113 M/sec
> 9,241,207 cpu_core/topdown-retiring/ # 7.5% Retiring
> 8,903,288 cpu_core/topdown-bad-spec/ # 7.2% Bad Speculation
> 66,590,029 cpu_core/topdown-fe-bound/ # 54.1% Frontend Bound
> 38,397,500 cpu_core/topdown-be-bound/ # 31.2% Backend Bound
> 3,294,283 cpu_core/topdown-heavy-ops/ # 2.7% Heavy Operations # 4.8% Light Operations
> 8,855,769 cpu_core/topdown-br-mispredict/ # 7.2% Branch Mispredict # 0.0% Machine Clears
> 57,695,714 cpu_core/topdown-fetch-lat/ # 46.9% Fetch Latency # 7.2% Fetch Bandwidth
> 12,823,926 cpu_core/topdown-mem-bound/ # 10.4% Memory Bound # 20.8% Core Bound
>
> 1.004093622 seconds time elapsed
>
> $ perf stat -a sleep 1
>
> Performance counter stats for 'system wide':
>
> 24,064.65 msec cpu-clock # 23.973 CPUs utilized
> 384 context-switches # 15.957 /sec
> 24 cpu-migrations # 0.997 /sec
> 71 page-faults # 2.950 /sec
> 19,737,646 cpu_core/cycles/ # 820.192 K/sec
> 122,018,505 cpu_atom/cycles/ # 5.070 M/sec (63.32%)
> 7,636,653 cpu_core/instructions/ # 317.339 K/sec
> 16,266,629 cpu_atom/instructions/ # 675.955 K/sec (72.50%)
> 1,552,995 cpu_core/branches/ # 64.534 K/sec
> 3,208,143 cpu_atom/branches/ # 133.314 K/sec (72.50%)
> 132,151 cpu_core/branch-misses/ # 5.491 K/sec
> 547,285 cpu_atom/branch-misses/ # 22.742 K/sec (72.49%)
> 32,110,597 cpu_atom/TOPDOWN_RETIRING.ALL/ # 1.334 M/sec
> # 18.4 % tma_bad_speculation (72.48%)
> 228,006,765 cpu_atom/TOPDOWN_FE_BOUND.ALL/ # 9.475 M/sec
> # 38.1 % tma_frontend_bound (72.47%)
> 225,866,251 cpu_atom/TOPDOWN_BE_BOUND.ALL/ # 9.386 M/sec
> # 37.7 % tma_backend_bound
> # 37.7 % tma_backend_bound_aux (72.73%)
> 119,748,254 cpu_atom/CPU_CLK_UNHALTED.CORE/ # 4.976 M/sec
> # 5.2 % tma_retiring (73.14%)
> 31,363,579 cpu_atom/TOPDOWN_RETIRING.ALL/ # 1.303 M/sec (73.37%)
> 227,907,321 cpu_atom/TOPDOWN_FE_BOUND.ALL/ # 9.471 M/sec (63.95%)
> 228,803,268 cpu_atom/TOPDOWN_BE_BOUND.ALL/ # 9.508 M/sec (63.55%)
> 113,357,334 cpu_core/TOPDOWN.SLOTS/ # 30.5 % tma_backend_bound
> # 9.2 % tma_retiring
> # 8.7 % tma_bad_speculation
> # 51.6 % tma_frontend_bound
> 10,451,044 cpu_core/topdown-retiring/
> 9,687,449 cpu_core/topdown-bad-spec/
> 58,703,214 cpu_core/topdown-fe-bound/
> 34,540,660 cpu_core/topdown-be-bound/
> 154,902 cpu_core/INT_MISC.UOP_DROPPING/ # 6.437 K/sec
>
> 1.003818397 seconds time elapsed
> '''

Thanks for the fixes. That should work for -M or --topdown options.
But I don't think the above output is better than the 6.3 for the
*default* of perf stat?

- The multiplexing in the atom core messes up the other events.
- The "M/sec" seems useless for the Topdown events.
- The tma_* is not a generic name.
"Retiring" is much better than "tma_retiring" as a generic annotation.
It should works for both X86 and Arm.

As the default, it's better to provide a clean and generic ouptput for
the end users.

If the users want to know more details, they can use -M or --topdown
options. The events/formats are expected to be different among ARCHs.

Also, there should be a bug for all atom Topdown events. They are
displayed twice.

Thanks,
Kan

Next message: Aradhya Bhatia: "Re: [PATCH v2 1/2] arm64: dts/ti: am65x: Add Rocktech OLDI panel DT overlay"
Previous message: Liang Yang: "Re: [PATCH v1 4/5] mtd: rawnand: meson: clear OOB buffer before read"
In reply to: Ian Rogers: "[PATCH v1 39/40] perf jevents: Don't rewrite metrics across PMUs"
Next in thread: Arnaldo Carvalho de Melo: "Re: [PATCH v1 00/40] Fix perf on Intel hybrid CPUs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]