Re: [PATCH v1 00/40] Fix perf on Intel hybrid CPUs

From: Arnaldo Carvalho de Melo
Date: Wed Apr 26 2023 - 17:09:49 EST


Em Wed, Apr 26, 2023 at 12:00:10AM -0700, Ian Rogers escreveu:
> TL;DR: hybrid doesn't crash, json metrics work on hybrid on both PMUs
> or individually, event parsing doesn't always scan all PMUs, more and
> new tests that also run without hybrid, less code.
>
> The first patches were previously posted to improve metrics here:
> "perf stat: Introduce skippable evsels"
> https://lore.kernel.org/all/20230414051922.3625666-1-irogers@xxxxxxxxxx/
> "perf vendor events intel: Add xxx metric constraints"
> https://lore.kernel.org/all/20230419005423.343862-1-irogers@xxxxxxxxxx/
>
> Next are some general test improvements.

Kan,

Have you looked at this? I'm doing a test build on it now.

- Arnaldo

> Next event parsing is rewritten to not scan all PMUs for the benefit
> of raw and legacy cache parsing, instead these are handled by the
> lexer and a new term type. This ultimately removes the need for the
> event parser for hybrid to be recursive as legacy cache can be just a
> term. Tests are re-enabled for events with hyphens, so AMD's
> branch-brs event is now parsable.
>
> The cputype option is made a generic pmu filter flag and is tested
> even on non-hybrid systems.
>
> The final patches address specific json metric issues on hybrid, in
> both the json metrics and the metric code. They also bring in a new
> json option to not group events when matching a metricgroup, this
> helps reduce counter pressure for TopdownL1 and TopdownL2 metric
> groups. The updates to the script that updates the json are posted in:
> https://github.com/intel/perfmon/pull/73
>
> The patches add slightly more code than they remove, in areas like
> better json metric constraints and tests, but in the core util code,
> the removal of hybrid is a net reduction:
> 20 files changed, 631 insertions(+), 951 deletions(-)
>
> There's specific detail with each patch, but for now here is the 6.3
> output followed by that from perf-tools-next with the patch series
> applied. The tool is running on an Alderlake CPU on an elderly 5.15
> kernel:
>
> Events on hybrid that parse and pass tests:
> '''
> $ perf-6.3 version
> perf version 6.3.rc7.gb7bc77e2f2c7
> $ perf-6.3 test
> ...
> 6.1: Test event parsing : FAILED!
> ...
> $ perf test
> ...
> 6: Parse event definition strings :
> 6.1: Test event parsing : Ok
> 6.2: Parsing of all PMU events from sysfs : Ok
> 6.3: Parsing of given PMU events from sysfs : Ok
> 6.4: Parsing of aliased events from sysfs : Skip (no aliases in sysfs)
> 6.5: Parsing of aliased events : Ok
> 6.6: Parsing of terms (event modifiers) : Ok
> ...
> '''
>
> No event/metric running with json metrics and TopdownL1 on both PMUs:
> '''
> $ perf-6.3 stat -a sleep 1
>
> Performance counter stats for 'system wide':
>
> 24,073.58 msec cpu-clock # 23.975 CPUs utilized
> 350 context-switches # 14.539 /sec
> 25 cpu-migrations # 1.038 /sec
> 66 page-faults # 2.742 /sec
> 21,257,199 cpu_core/cycles/ # 883.009 K/sec
> 2,162,192 cpu_atom/cycles/ # 89.816 K/sec
> 6,679,379 cpu_core/instructions/ # 277.457 K/sec
> 753,197 cpu_atom/instructions/ # 31.287 K/sec
> 1,300,647 cpu_core/branches/ # 54.028 K/sec
> 148,652 cpu_atom/branches/ # 6.175 K/sec
> 117,429 cpu_core/branch-misses/ # 4.878 K/sec
> 14,396 cpu_atom/branch-misses/ # 598.000 /sec
> 123,097,644 cpu_core/slots/ # 5.113 M/sec
> 9,241,207 cpu_core/topdown-retiring/ # 7.5% Retiring
> 8,903,288 cpu_core/topdown-bad-spec/ # 7.2% Bad Speculation
> 66,590,029 cpu_core/topdown-fe-bound/ # 54.1% Frontend Bound
> 38,397,500 cpu_core/topdown-be-bound/ # 31.2% Backend Bound
> 3,294,283 cpu_core/topdown-heavy-ops/ # 2.7% Heavy Operations # 4.8% Light Operations
> 8,855,769 cpu_core/topdown-br-mispredict/ # 7.2% Branch Mispredict # 0.0% Machine Clears
> 57,695,714 cpu_core/topdown-fetch-lat/ # 46.9% Fetch Latency # 7.2% Fetch Bandwidth
> 12,823,926 cpu_core/topdown-mem-bound/ # 10.4% Memory Bound # 20.8% Core Bound
>
> 1.004093622 seconds time elapsed
>
> $ perf stat -a sleep 1
>
> Performance counter stats for 'system wide':
>
> 24,064.65 msec cpu-clock # 23.973 CPUs utilized
> 384 context-switches # 15.957 /sec
> 24 cpu-migrations # 0.997 /sec
> 71 page-faults # 2.950 /sec
> 19,737,646 cpu_core/cycles/ # 820.192 K/sec
> 122,018,505 cpu_atom/cycles/ # 5.070 M/sec (63.32%)
> 7,636,653 cpu_core/instructions/ # 317.339 K/sec
> 16,266,629 cpu_atom/instructions/ # 675.955 K/sec (72.50%)
> 1,552,995 cpu_core/branches/ # 64.534 K/sec
> 3,208,143 cpu_atom/branches/ # 133.314 K/sec (72.50%)
> 132,151 cpu_core/branch-misses/ # 5.491 K/sec
> 547,285 cpu_atom/branch-misses/ # 22.742 K/sec (72.49%)
> 32,110,597 cpu_atom/TOPDOWN_RETIRING.ALL/ # 1.334 M/sec
> # 18.4 % tma_bad_speculation (72.48%)
> 228,006,765 cpu_atom/TOPDOWN_FE_BOUND.ALL/ # 9.475 M/sec
> # 38.1 % tma_frontend_bound (72.47%)
> 225,866,251 cpu_atom/TOPDOWN_BE_BOUND.ALL/ # 9.386 M/sec
> # 37.7 % tma_backend_bound
> # 37.7 % tma_backend_bound_aux (72.73%)
> 119,748,254 cpu_atom/CPU_CLK_UNHALTED.CORE/ # 4.976 M/sec
> # 5.2 % tma_retiring (73.14%)
> 31,363,579 cpu_atom/TOPDOWN_RETIRING.ALL/ # 1.303 M/sec (73.37%)
> 227,907,321 cpu_atom/TOPDOWN_FE_BOUND.ALL/ # 9.471 M/sec (63.95%)
> 228,803,268 cpu_atom/TOPDOWN_BE_BOUND.ALL/ # 9.508 M/sec (63.55%)
> 113,357,334 cpu_core/TOPDOWN.SLOTS/ # 30.5 % tma_backend_bound
> # 9.2 % tma_retiring
> # 8.7 % tma_bad_speculation
> # 51.6 % tma_frontend_bound
> 10,451,044 cpu_core/topdown-retiring/
> 9,687,449 cpu_core/topdown-bad-spec/
> 58,703,214 cpu_core/topdown-fe-bound/
> 34,540,660 cpu_core/topdown-be-bound/
> 154,902 cpu_core/INT_MISC.UOP_DROPPING/ # 6.437 K/sec
>
> 1.003818397 seconds time elapsed
> '''
>
> Json metrics that don't crash:
> '''
> $ perf-6.3 stat -M TopdownL1 -a sleep 1
> WARNING: events in group from different hybrid PMUs!
> WARNING: grouped events cpus do not match, disabling group:
> anon group { topdown-retiring, topdown-retiring, INT_MISC.UOP_DROPPING, topdown-fe-bound, topdown-fe-bound, CPU_CLK_UNHALTED.CORE, topdown-be-bound, topdown-be-bound, topdown-bad-spec, topdown-bad-spec }
> Error:
> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (topdown-retiring).
> /bin/dmesg | grep -i perf may provide additional information.
>
> $ perf stat -M TopdownL1 -a sleep 1
>
> Performance counter stats for 'system wide':
>
> 811,810 cpu_atom/TOPDOWN_RETIRING.ALL/ # 26.6 % tma_bad_speculation
> 3,239,281 cpu_atom/TOPDOWN_FE_BOUND.ALL/ # 38.8 % tma_frontend_bound
> 2,037,667 cpu_atom/TOPDOWN_BE_BOUND.ALL/ # 24.4 % tma_backend_bound
> # 24.4 % tma_backend_bound_aux
> 1,670,438 cpu_atom/CPU_CLK_UNHALTED.CORE/ # 9.7 % tma_retiring
> 808,138 cpu_atom/TOPDOWN_RETIRING.ALL/
> 3,234,707 cpu_atom/TOPDOWN_FE_BOUND.ALL/
> 2,081,420 cpu_atom/TOPDOWN_BE_BOUND.ALL/
> 122,795,280 cpu_core/TOPDOWN.SLOTS/ # 31.7 % tma_backend_bound
> # 7.0 % tma_bad_speculation
> # 54.1 % tma_frontend_bound
> # 7.2 % tma_retiring
> 8,817,636 cpu_core/topdown-retiring/
> 8,480,817 cpu_core/topdown-bad-spec/
> 3,108,926 cpu_core/topdown-heavy-ops/
> 66,566,215 cpu_core/topdown-fe-bound/
> 38,958,811 cpu_core/topdown-be-bound/
> 134,194 cpu_core/INT_MISC.UOP_DROPPING/
>
> 1.003607796 seconds time elapsed
>
> $ perf stat -M TopdownL2 -a sleep 1
>
> Performance counter stats for 'system wide':
>
> 162,334,218 cpu_atom/TOPDOWN_FE_BOUND.FRONTEND_LATENCY/ # 27.7 % tma_fetch_latency (38.99%)
> 16,191,486 cpu_atom/INST_RETIRED.ANY/ (45.76%)
> 68,443,205 cpu_atom/TOPDOWN_BE_BOUND.MEM_SCHEDULER/ # 32.2 % tma_memory_bound
> # 5.8 % tma_core_bound (45.77%)
> 14,920,109 cpu_atom/UOPS_RETIRED.MS/ # 2.9 % tma_base (45.92%)
> 14,829,879 cpu_atom/UOPS_RETIRED.MS/ # 2.5 % tma_ms_uops (46.31%)
> 31,860,520 cpu_atom/TOPDOWN_RETIRING.ALL/ (46.71%)
> 117,323,055 cpu_atom/CPU_CLK_UNHALTED.CORE/ # 18.7 % tma_branch_mispredicts
> # 11.5 % tma_fetch_bandwidth
> # 0.3 % tma_machine_clears
> # 37.9 % tma_resource_bound (53.49%)
> 222,579,768 cpu_atom/TOPDOWN_BE_BOUND.ALL/ (53.90%)
> 13,672,174 cpu_atom/MEM_SCHEDULER_BLOCK.ST_BUF/ (54.23%)
> 24,264,262 cpu_atom/LD_HEAD.ANY_AT_RET/ (47.46%)
> 13,872,813 cpu_atom/MEM_SCHEDULER_BLOCK.ALL/ (47.45%)
> 223,722,007 cpu_atom/TOPDOWN_BE_BOUND.ALL/ (47.31%)
> 2,005,972 cpu_atom/TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS/ (46.91%)
> 109,423,013 cpu_atom/TOPDOWN_BAD_SPECULATION.MISPREDICT/ (39.72%)
> 67,420,790 cpu_atom/TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH/ (39.33%)
> 92,790,312 cpu_core/TOPDOWN.SLOTS/ # 24.3 % tma_core_bound
> # 3.0 % tma_heavy_operations
> # 5.6 % tma_light_operations
> # 10.8 % tma_memory_bound
> # 7.8 % tma_branch_mispredicts
> # 40.4 % tma_fetch_latency
> # 0.2 % tma_machine_clears
> # 7.8 % tma_fetch_bandwidth
> 8,041,595 cpu_core/topdown-retiring/
> 10,060,500 cpu_core/topdown-mem-bound/
> 7,314,344 cpu_core/topdown-bad-spec/
> 2,824,600 cpu_core/topdown-heavy-ops/
> 37,630,164 cpu_core/topdown-fetch-lat/
> 7,278,843 cpu_core/topdown-br-mispredict/
> 44,863,148 cpu_core/topdown-fe-bound/
> 32,573,458 cpu_core/topdown-be-bound/
> 5,785,074 cpu_core/INST_RETIRED.ANY/
> 2,325,424 cpu_core/UOPS_RETIRED.MS/
> 15,972,774 cpu_core/CPU_CLK_UNHALTED.THREAD/
> 117,750 cpu_core/INT_MISC.UOP_DROPPING/
>
> 1.003519749 seconds time elapsed
> '''
>
> Note, flags are added below to reduce the size of the output by
> removing event groups and threshold printing support:
> '''
> $ perf stat --metric-no-threshold --metric-no-group -M TopdownL3 -a sleep 1
>
> Performance counter stats for 'system wide':
>
> 3,506,641 cpu_atom/TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS/ # 0.6 % tma_alloc_restriction (17.14%)
> 133,962,390 cpu_atom/TOPDOWN_BE_BOUND.SERIALIZATION/ # 22.2 % tma_serialization (17.48%)
> 11,201,207 cpu_atom/TOPDOWN_FE_BOUND.ITLB/ # 1.9 % tma_itlb_misses (17.88%)
> 63,876,838 cpu_atom/TOPDOWN_BE_BOUND.MEM_SCHEDULER/ # 10.6 % tma_mem_scheduler
> # 10.5 % tma_store_bound
> # 2.4 % tma_other_load_store (18.28%)
> 14,386,940 cpu_atom/UOPS_RETIRED.MS/ (18.68%)
> 14,432,493 cpu_atom/UOPS_RETIRED.MS/ # 2.7 % tma_other_ret (19.09%)
> 81,582,687 cpu_atom/TOPDOWN_FE_BOUND.ICACHE/ # 13.5 % tma_icache_misses (19.14%)
> 30,467,546 cpu_atom/TOPDOWN_RETIRING.ALL/ (19.14%)
> 16,788,753 cpu_atom/MEM_BOUND_STALLS.LOAD/ # 4.2 % tma_dram_bound
> # 3.7 % tma_l2_bound
> # 6.7 % tma_l3_bound (19.14%)
> 14,514,040 cpu_atom/TOPDOWN_FE_BOUND.DECODE/ # 2.4 % tma_decode (19.14%)
> 688,307 cpu_atom/TOPDOWN_BAD_SPECULATION.NUKE/ # 0.1 % tma_nuke (19.13%)
> 0 cpu_atom/UOPS_RETIRED.FPDIV/ (19.12%)
> 4,408,466 cpu_atom/MEM_BOUND_STALLS.LOAD_L2_HIT/ (19.12%)
> 120,556,998 cpu_atom/CPU_CLK_UNHALTED.CORE/ # 9.3 % tma_branch_detect
> # 1.0 % tma_branch_resteer
> # 5.8 % tma_cisc
> # 0.3 % tma_fast_nuke
> # 0.0 % tma_fpdiv_uops
> # 4.3 % tma_l1_bound
> # 3.2 % tma_non_mem_scheduler
> # 1.9 % tma_other_fb
> # 1.1 % tma_predecode
> # 0.1 % tma_register
> # 0.1 % tma_reorder_buffer (22.30%)
> 34,773,106 cpu_atom/TOPDOWN_FE_BOUND.CISC/ (22.30%)
> 591,112 cpu_atom/TOPDOWN_BE_BOUND.REGISTER/ (22.30%)
> 11,286,706 cpu_atom/TOPDOWN_FE_BOUND.OTHER/ (22.30%)
> 5,082,636 cpu_atom/MEM_BOUND_STALLS.LOAD_DRAM_HIT/ (22.30%)
> 14,146,185 cpu_atom/MEM_SCHEDULER_BLOCK.ST_BUF/ (22.31%)
> 55,833,686 cpu_atom/TOPDOWN_FE_BOUND.BRANCH_DETECT/ (22.30%)
> 25,714,051 cpu_atom/LD_HEAD.ANY_AT_RET/ (19.12%)
> 456,549 cpu_atom/TOPDOWN_BE_BOUND.REORDER_BUFFER/ (19.12%)
> 1,616,862 cpu_atom/TOPDOWN_BAD_SPECULATION.FASTNUKE/ (19.12%)
> 6,680,782 cpu_atom/TOPDOWN_FE_BOUND.PREDECODE/ (19.12%)
> 14,229,195 cpu_atom/MEM_SCHEDULER_BLOCK.ALL/ (19.12%)
> 8,128,921 cpu_atom/MEM_BOUND_STALLS.LOAD_LLC_HIT/ (19.12%)
> 20,941,725 cpu_atom/LD_HEAD.L1_MISS_AT_RET/ (19.11%)
> 6,177,125 cpu_atom/TOPDOWN_FE_BOUND.BRANCH_RESTEER/ (18.78%)
> 228,066,346 cpu_atom/TOPDOWN_BE_BOUND.ALL/ (18.38%)
> 5,204,897 cpu_atom/LD_HEAD.L1_BOUND_AT_RET/ (17.99%)
> 19,060,104 cpu_atom/TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER/ (17.58%)
> 0 cpu_atom/UOPS_RETIRED.FPDIV/ (17.19%)
> 864,565,692 cpu_core/TOPDOWN.SLOTS/ # 4.7 % tma_microcode_sequencer
> # 0.4 % tma_few_uops_instructions
> # 0.3 % tma_fused_instructions
> # 1.8 % tma_memory_operations
> # 0.1 % tma_nop_instructions
> # 8.9 % tma_ms_switches
> # 0.4 % tma_non_fused_branches
> # 0.0 % tma_fp_arith
> # 0.0 % tma_int_operations
> # 35.7 % tma_ports_utilization
> # 3.8 % tma_other_light_ops (18.03%)
> 100,519,954 cpu_core/topdown-retiring/ (18.03%)
> 68,964,454 cpu_core/topdown-bad-spec/ (18.03%)
> 44,732,021 cpu_core/topdown-heavy-ops/ (18.03%)
> 435,618,316 cpu_core/topdown-fe-bound/ (18.03%)
> 262,842,804 cpu_core/topdown-be-bound/ (18.03%)
> 10,368,608 cpu_core/BR_INST_RETIRED.ALL_BRANCHES/ (18.43%)
> 55,947,727 cpu_core/RESOURCE_STALLS.SCOREBOARD/ (18.84%)
> 125,718,255 cpu_core/UOPS_ISSUED.ANY/ (19.24%)
> 23,178,652 cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (19.65%)
> 0 cpu_core/INT_VEC_RETIRED.ADD_256/ (20.05%)
> 1,119,514 cpu_core/DSB2MITE_SWITCHES.PENALTY_CYCLES/ # 0.5 % tma_dsb_switches (20.46%)
> 27,684,795 cpu_core/MEMORY_ACTIVITY.STALLS_L1D_MISS/ # 10.6 % tma_l1_bound
> # 0.7 % tma_l2_bound (20.86%)
> 108,813,079 cpu_core/UOPS_EXECUTED.THREAD/ (21.27%)
> 16,563,036 cpu_core/IDQ.MITE_CYCLES_ANY/ # 5.2 % tma_mite (19.14%)
> 53,037,471 cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (19.14%)
> 41,005,510 cpu_core/UOPS_RETIRED.MS/ (19.14%)
> 575,534 cpu_core/ARITH.DIV_ACTIVE/ # 0.2 % tma_divider (19.14%)
> 0 cpu_core/FP_ARITH_INST_RETIRED.SCALAR_SINGLE,umask=0x03/ (19.14%)
> 2,207,021 cpu_core/EXE_ACTIVITY.BOUND_ON_STORES/ # 0.9 % tma_store_bound (19.13%)
> 5,685,032 cpu_core/UOPS_RETIRED.MS,cmask=1,edge/ (19.13%)
> 25,523 cpu_core/DECODE.LCP/ # 0.0 % tma_lcp (19.12%)
> 26,095,298 cpu_core/MEMORY_ACTIVITY.STALLS_L2_MISS/ # 10.8 % tma_l3_bound (19.13%)
> 108,516 cpu_core/MEMORY_ACTIVITY.STALLS_L3_MISS/ # 0.0 % tma_dram_bound (19.13%)
> 192,239,590 cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (19.12%)
> 5,978 cpu_core/LSD.CYCLES_ACTIVE/ # -0.0 % tma_lsd (19.12%)
> 0 cpu_core/INT_VEC_RETIRED.VNNI_128/ (19.13%)
> 137,530,949 cpu_core/CPU_CLK_UNHALTED.DISTRIBUTED/ # 0.1 % tma_dsb (19.12%)
> 240,070,549 cpu_core/CPU_CLK_UNHALTED.THREAD/ # 17.5 % tma_icache_misses
> # 6.1 % tma_itlb_misses
> # 40.3 % tma_branch_resteers (21.52%)
> 0 cpu_core/FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE,umask=0x3c/ (21.51%)
> 595,051 cpu_core/ARITH.DIV_ACTIVE/ (21.52%)
> 461,041 cpu_core/IDQ.DSB_CYCLES_ANY/ (21.51%)
> 0 cpu_core/INT_VEC_RETIRED.MUL_256/ (21.52%)
> 0 cpu_core/UOPS_EXECUTED.X87/ (21.52%)
> 237,196 cpu_core/IDQ.DSB_CYCLES_OK/ (21.52%)
> 125,009 cpu_core/LSD.CYCLES_OK/ (21.52%)
> 0 cpu_core/INT_VEC_RETIRED.ADD_128/ (21.40%)
> 28,388,778 cpu_core/MEM_UOP_RETIRED.ANY/ (18.61%)
> 1,806,629 cpu_core/INST_RETIRED.NOP/ (18.21%)
> 41,928,018 cpu_core/ICACHE_DATA.STALLS/ (17.81%)
> 0 cpu_core/INT_VEC_RETIRED.VNNI_256/ (17.41%)
> 18,230,137 cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (17.02%)
> 28,052,001 cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (16.61%)
> 4,073,568 cpu_core/INST_RETIRED.MACRO_FUSED/ (16.20%)
> 66,509,871 cpu_core/INT_MISC.UNKNOWN_BRANCH_CYCLES/ (15.92%)
> 2,307,447 cpu_core/IDQ.MITE_CYCLES_OK/ (15.91%)
> 30,345,769 cpu_core/INT_MISC.CLEAR_RESTEER_CYCLES/ (15.91%)
> 0 cpu_core/INT_VEC_RETIRED.SHUFFLES/ (15.91%)
> 14,722,079 cpu_core/ICACHE_TAG.STALLS/ (15.90%)
>
> 1.004474469 seconds time elapsed
>
> $ perf stat --metric-no-threshold --metric-no-group -M TopdownL4 -a sleep 1
>
> Performance counter stats for 'system wide':
>
> 1,004,834,399 ns duration_time # 0.3 % tma_false_sharing
> # 40.2 % tma_l3_hit_latency
> # 4.4 % tma_contested_accesses
> # 1.6 % tma_data_sharing
> 3,762,410 cpu_atom/LD_HEAD.PGWALK_AT_RET/ # 3.1 % tma_stlb_miss (33.58%)
> 10 cpu_atom/MACHINE_CLEARS.SMC/ # 0.0 % tma_smc (33.98%)
> 66,500,689 cpu_atom/TOPDOWN_BE_BOUND.MEM_SCHEDULER/ # 0.0 % tma_ld_buffer
> # 0.0 % tma_rsv
> # 11.0 % tma_st_buffer (29.60%)
> 1,051,312 cpu_atom/LD_HEAD.OTHER_AT_RET/ # 0.9 % tma_other_l1 (30.00%)
> 14,740,093 cpu_atom/UOPS_RETIRED.MS/ (30.39%)
> 117,899 cpu_atom/LD_HEAD.DTLB_MISS_AT_RET/ # 0.1 % tma_stlb_hit (30.79%)
> 701,548 cpu_atom/TOPDOWN_BAD_SPECULATION.NUKE/ # 0.0 % tma_disambiguation
> # 0.0 % tma_fp_assist
> # 0.1 % tma_memory_ordering
> # 0.0 % tma_page_fault (31.08%)
> 12,873 cpu_atom/MACHINE_CLEARS.MEMORY_ORDERING/ (31.07%)
> 58,321 cpu_atom/MEM_SCHEDULER_BLOCK.LD_BUF/ (31.07%)
> 43,458 cpu_atom/MEM_SCHEDULER_BLOCK.RSV/ (31.07%)
> 14,256,005 cpu_atom/MEM_SCHEDULER_BLOCK.ALL/ (31.06%)
> 122,156,534 cpu_atom/CPU_CLK_UNHALTED.CORE/ # 0.0 % tma_store_fwd_blk (36.16%)
> 0 cpu_atom/MACHINE_CLEARS.FP_ASSIST/ (35.76%)
> 13,804 cpu_atom/MACHINE_CLEARS.SLOW/ (35.35%)
> 14,388,300 cpu_atom/MEM_SCHEDULER_BLOCK.ST_BUF/ (34.95%)
> 493,070,443 cpu_atom/CPU_CLK_UNHALTED.REF_TSC/ (39.73%)
> 2 cpu_atom/MACHINE_CLEARS.PAGE_FAULT/ (39.33%)
> 1,101 cpu_atom/LD_HEAD.ST_ADDR_AT_RET/ (38.93%)
> 929 cpu_atom/MACHINE_CLEARS.DISAMBIGUATION/ (38.55%)
> 14,241,213 cpu_atom/MEM_SCHEDULER_BLOCK.ALL/ (33.45%)
> 1,010,981,054 cpu_core/TOPDOWN.SLOTS/ # 0.0 % tma_assists
> # 4.3 % tma_cisc
> # 0.0 % tma_fp_scalar
> # 0.0 % tma_fp_vector
> # 0.0 % tma_shuffles
> # 0.0 % tma_int_vector_128b
> # 0.0 % tma_x87_use
> # 0.0 % tma_int_vector_256b
> # 0.7 % tma_clears_resteers
> # 12.4 % tma_mispredicts_resteers (8.14%)
> 132,375,316 cpu_core/topdown-retiring/ (8.14%)
> 88,303,327 cpu_core/topdown-bad-spec/ (8.14%)
> 85,519,216 cpu_core/topdown-br-mispredict/ (8.14%)
> 495,722,455 cpu_core/topdown-fe-bound/ (8.14%)
> 298,147,134 cpu_core/topdown-be-bound/ (8.14%)
> 21,418,803 cpu_core/UOPS_EXECUTED.CYCLES_GE_3/ # 8.8 % tma_ports_utilized_3m (10.12%)
> 35,208,716 cpu_core/OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD,cmask=4/ # 14.5 % tma_mem_bandwidth
> # 33.3 % tma_mem_latency (10.52%)
> 17,358 cpu_core/OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM/ (10.91%)
> 55,883,811 cpu_core/RESOURCE_STALLS.SCOREBOARD/ # 24.1 % tma_ports_utilized_0 (12.91%)
> 0 cpu_core/INT_VEC_RETIRED.ADD_256/ (14.89%)
> 139,890 cpu_core/DTLB_STORE_MISSES.STLB_HIT,cmask=1/ # 2.8 % tma_dtlb_store (15.30%)
> 216,886 cpu_core/MEM_INST_RETIRED.LOCK_LOADS/ # 3.8 % tma_store_latency
> # 0.1 % tma_lock_latency (15.71%)
> 115,948,790 cpu_core/UOPS_EXECUTED.THREAD/ (17.69%)
> 52,155,508 cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (15.93%)
> 6 cpu_core/ASSISTS.ANY,umask=0x1B/ (15.93%)
> 87,422,517 cpu_core/CYCLE_ACTIVITY.CYCLES_MEM_ANY/ # 5.2 % tma_dtlb_load (15.81%)
> 37,420,652 cpu_core/MEMORY_ACTIVITY.CYCLES_L1D_MISS/ (15.44%)
> 43,527,357 cpu_core/UOPS_RETIRED.MS/ (15.04%)
> 31,787,227 cpu_core/INT_MISC.CLEAR_RESTEER_CYCLES/ (14.64%)
> 0 cpu_core/FP_ARITH_INST_RETIRED.SCALAR_SINGLE,umask=0x03/ (14.24%)
> 4,899,130 cpu_core/XQ.FULL_CYCLES/ # 2.0 % tma_sq_full (13.84%)
> 1,365 cpu_core/OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM/ (13.44%)
> 23,904,338 cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ # 9.9 % tma_ports_utilized_1 (13.05%)
> 251,479 cpu_core/L2_RQSTS.ALL_RFO/ (12.76%)
> 188,701,010 cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (12.74%)
> 6,909 cpu_core/MEM_INST_RETIRED.SPLIT_STORES/ # 0.0 % tma_split_stores (12.74%)
> 619,775 cpu_core/MEM_LOAD_RETIRED.L1_MISS/ (9.56%)
> 136,716,345 cpu_core/CPU_CLK_UNHALTED.DISTRIBUTED/ # 0.9 % tma_decoder0_alone (11.15%)
> 0 cpu_core/INT_VEC_RETIRED.VNNI_128/ (12.74%)
> 605,850 cpu_core/L1D_PEND_MISS.FB_FULL/ # 0.2 % tma_fb_full (12.73%)
> 60,079 cpu_core/MEM_STORE_RETIRED.L2_HIT/ (11.14%)
> 242,508,080 cpu_core/CPU_CLK_UNHALTED.THREAD/ # 4.2 % tma_ports_utilized_2
> # 0.2 % tma_store_fwd_blk
> # 0.0 % tma_streaming_stores
> # 27.5 % tma_unknown_branches
> # 0.0 % tma_split_loads (12.74%)
> 0 cpu_core/FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE,umask=0x3c/ (14.33%)
> 32,573 cpu_core/LD_BLOCKS.STORE_FORWARD/ (12.74%)
> 1,130 cpu_core/OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD/ (12.74%)
> 4,029 cpu_core/MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS/ (9.56%)
> 4,844,548 cpu_core/INST_DECODED.DECODERS,cmask=1/ (9.56%)
> 5,266 cpu_core/MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD/ (6.37%)
> 0 cpu_core/UOPS_EXECUTED.X87/ (7.96%)
> 0 cpu_core/INT_VEC_RETIRED.MUL_256/ (9.56%)
> 2,786,473 cpu_core/DTLB_STORE_MISSES.WALK_ACTIVE/ (9.56%)
> 961,614,001 cpu_core/CPU_CLK_UNHALTED.REF_TSC/ (11.15%)
> 2,433,107 cpu_core/INST_DECODED.DECODERS,cmask=2/ (11.15%)
> 0 cpu_core/INT_VEC_RETIRED.ADD_128/ (12.74%)
> 9,058,046 cpu_core/OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO/ (12.74%)
> 6,399,992 cpu_core/MEM_INST_RETIRED.ALL_STORES/ (12.74%)
> 45,519,749 cpu_core/L1D_PEND_MISS.PENDING/ (9.56%)
> 12,200,559 cpu_core/DTLB_LOAD_MISSES.WALK_ACTIVE/ (7.97%)
> 115,944,190 cpu_core/OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD/ (6.37%)
> 0 cpu_core/INT_VEC_RETIRED.VNNI_256/ (7.96%)
> 1,885,278 cpu_core/INT_MISC.UOP_DROPPING/ (9.56%)
> 524,819 cpu_core/MEM_LOAD_RETIRED.FB_HIT/ (9.56%)
> 26,866,872 cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (11.15%)
> 10,265,977 cpu_core/EXE_ACTIVITY.2_PORTS_UTIL/ (12.74%)
> 66,662,934 cpu_core/INT_MISC.UNKNOWN_BRANCH_CYCLES/ (12.74%)
> 0 cpu_core/OCR.STREAMING_WR.ANY_RESPONSE/ (12.74%)
> 12,499 cpu_core/MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD/ (12.74%)
> 0 cpu_core/INT_VEC_RETIRED.SHUFFLES/ (12.74%)
> 47,649 cpu_core/DTLB_LOAD_MISSES.STLB_HIT,cmask=1/ (12.74%)
> 106,424 cpu_core/L2_RQSTS.RFO_HIT/ (12.74%)
> 0 cpu_core/LD_BLOCKS.NO_SR/ (7.97%)
> 1,343,692 cpu_core/MEM_LOAD_COMPLETED.L1_MISS_ANY/ (7.96%)
> 28,517 cpu_core/L1D_PEND_MISS.L2_STALLS/ (6.37%)
> 394,101 cpu_core/MEM_LOAD_RETIRED.L3_HIT/ (6.36%)
> 76,860,165,929 TSC
>
> 1.004834399 seconds time elapsed
>
> $ perf stat --metric-no-threshold --metric-no-group -M TopdownL5 -a sleep 1
>
> Performance counter stats for 'system wide':
>
> 839,538,302 cpu_core/TOPDOWN.SLOTS/ # 0.0 % tma_avx_assists
> # 0.0 % tma_fp_assists
> # 0.0 % tma_page_faults
> # 0.0 % tma_fp_vector_128b
> # 0.0 % tma_fp_vector_256b (32.40%)
> 100,274,045 cpu_core/topdown-retiring/ (32.40%)
> 77,425,642 cpu_core/topdown-bad-spec/ (32.40%)
> 424,563,652 cpu_core/topdown-fe-bound/ (32.40%)
> 245,420,564 cpu_core/topdown-be-bound/ (32.40%)
> 0 cpu_core/FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE/ (32.79%)
> 54,372,921 cpu_core/RESOURCE_STALLS.SCOREBOARD/ # 22.2 % tma_serializing_operation (33.20%)
> 23,018,585 cpu_core/UOPS_DISPATCHED.PORT_6/ # 8.0 % tma_alu_op_utilization (33.61%)
> 17,748,101 cpu_core/UOPS_DISPATCHED.PORT_2_3_10/ # 4.2 % tma_load_op_utilization (34.02%)
> 0 cpu_core/FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE/ (34.43%)
> 7,616,700 cpu_core/UOPS_DISPATCHED.PORT_0/ (34.83%)
> 96,571 cpu_core/DTLB_STORE_MISSES.STLB_HIT,cmask=1/ # 0.6 % tma_store_stlb_hit (35.25%)
> 84,909,672 cpu_core/CYCLE_ACTIVITY.CYCLES_MEM_ANY/ # 0.2 % tma_load_stlb_hit (35.66%)
> 32,935,744 cpu_core/MEMORY_ACTIVITY.CYCLES_L1D_MISS/ (31.95%)
> 16,597,385 cpu_core/UOPS_DISPATCHED.PORT_5_11/ (31.95%)
> 9,452,844 cpu_core/UOPS_DISPATCHED.PORT_1/ (31.94%)
> 2,620,695 cpu_core/DTLB_STORE_MISSES.WALK_ACTIVE/ # 1.8 % tma_store_stlb_miss (31.95%)
> 15,699,364 cpu_core/UOPS_DISPATCHED.PORT_7_8/ # 5.7 % tma_store_op_utilization (31.95%)
> 0 cpu_core/FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE/ (31.94%)
> 142,096,670 cpu_core/CPU_CLK_UNHALTED.DISTRIBUTED/ (31.95%)
> 244,591,239 cpu_core/CPU_CLK_UNHALTED.THREAD/ # 5.2 % tma_load_stlb_miss
> # 0.0 % tma_mixing_vectors (35.92%)
> 2,728,385 cpu_core/DTLB_STORE_MISSES.WALK_ACTIVE/ (35.66%)
> 0 cpu_core/ASSISTS.SSE_AVX_MIX/ (35.27%)
> 0 cpu_core/FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE/ (34.86%)
> 12,664,768 cpu_core/DTLB_LOAD_MISSES.WALK_ACTIVE/ (34.46%)
> 12,629,733 cpu_core/DTLB_LOAD_MISSES.WALK_ACTIVE/ (34.04%)
> 0 cpu_core/ASSISTS.FP/ (33.63%)
> 12 cpu_core/ASSISTS.PAGE_FAULT/ (33.23%)
> 16,704,699 cpu_core/UOPS_DISPATCHED.PORT_4_9/ (32.81%)
> 48,386 cpu_core/DTLB_LOAD_MISSES.STLB_HIT,cmask=1/ (28.68%)
>
> 1.002806967 seconds time elapsed
>
> $ perf stat --metric-no-threshold --metric-no-group -M TopdownL6 -a sleep 1
>
> Performance counter stats for 'system wide':
>
> 743,684 cpu_core/UOPS_DISPATCHED.PORT_0/ # 4.6 % tma_port_0
> 1,514 cpu_core/MISC2_RETIRED.LFENCE/ # 0.1 % tma_memory_fence
> 22,120 cpu_core/CPU_CLK_UNHALTED.PAUSE/ # 0.1 % tma_slow_pause
> 16,187,637 cpu_core/CPU_CLK_UNHALTED.DISTRIBUTED/ # 4.5 % tma_port_1
> # 12.6 % tma_port_6
> 16,754,672 cpu_core/CPU_CLK_UNHALTED.THREAD/
> 728,805 cpu_core/UOPS_DISPATCHED.PORT_1/
> 2,040,181 cpu_core/UOPS_DISPATCHED.PORT_6/
>
> 1.002727371 seconds time elapse
> '''
>
> Using --cputype:
> '''
> $ perf stat --cputype=core -M TopdownL1 -a sleep 1
>
> Performance counter stats for 'system wide':
>
> 90,542,172 cpu_core/TOPDOWN.SLOTS/ # 31.3 % tma_backend_bound
> # 7.0 % tma_bad_speculation
> # 54.0 % tma_frontend_bound
> # 7.6 % tma_retiring
> 6,917,885 cpu_core/topdown-retiring/
> 6,242,227 cpu_core/topdown-bad-spec/
> 2,353,956 cpu_core/topdown-heavy-ops/
> 49,034,945 cpu_core/topdown-fe-bound/
> 28,390,484 cpu_core/topdown-be-bound/
> 98,299 cpu_core/INT_MISC.UOP_DROPPING/
>
> 1.002395582 seconds time elapsed
>
> $ perf stat --cputype=atom -M TopdownL1 -a sleep 1
>
> Performance counter stats for 'system wide':
>
> 645,836 cpu_atom/TOPDOWN_RETIRING.ALL/ # 26.4 % tma_bad_speculation
> 2,404,468 cpu_atom/TOPDOWN_FE_BOUND.ALL/ # 38.9 % tma_frontend_bound
> 1,455,604 cpu_atom/TOPDOWN_BE_BOUND.ALL/ # 23.6 % tma_backend_bound
> # 23.6 % tma_backend_bound_aux
> 1,235,109 cpu_atom/CPU_CLK_UNHALTED.CORE/ # 10.4 % tma_retiring
> 642,124 cpu_atom/TOPDOWN_RETIRING.ALL/
> 2,398,892 cpu_atom/TOPDOWN_FE_BOUND.ALL/
> 1,503,157 cpu_atom/TOPDOWN_BE_BOUND.ALL/
>
> 1.002061651 seconds time elapsed
> '''
>
> Ian Rogers (40):
> perf stat: Introduce skippable evsels
> perf vendor events intel: Add alderlake metric constraints
> perf vendor events intel: Add icelake metric constraints
> perf vendor events intel: Add icelakex metric constraints
> perf vendor events intel: Add sapphirerapids metric constraints
> perf vendor events intel: Add tigerlake metric constraints
> perf stat: Avoid segv on counter->name
> perf test: Test more sysfs events
> perf test: Use valid for PMU tests
> perf test: Mask config then test
> perf test: Test more with config_cache
> perf test: Roundtrip name, don't assume 1 event per name
> perf parse-events: Set attr.type to PMU type early
> perf print-events: Avoid unnecessary strlist
> perf parse-events: Avoid scanning PMUs before parsing
> perf test: Validate events with hyphens in
> perf evsel: Modify group pmu name for software events
> perf test: Move x86 hybrid tests to arch/x86
> perf test x86 hybrid: Don't assume evlist order
> perf parse-events: Support PMUs for legacy cache events
> perf parse-events: Wildcard legacy cache events
> perf print-events: Print legacy cache events for each PMU
> perf parse-events: Support wildcards on raw events
> perf parse-events: Remove now unused hybrid logic
> perf parse-events: Minor type safety cleanup
> perf parse-events: Add pmu filter
> perf stat: Make cputype filter generic
> perf test: Add cputype testing to perf stat
> perf test: Fix parse-events tests for >1 core PMU
> perf parse-events: Support hardware events as terms
> perf parse-events: Avoid error when assigning a term
> perf parse-events: Avoid error when assigning a legacy cache term
> perf parse-events: Don't auto merge hybrid wildcard events
> perf parse-events: Don't reorder atom cpu events
> perf metrics: Be PMU specific for referenced metrics.
> perf metric: Json flag to not group events if gathering a metric group
> perf stat: Command line PMU metric filtering
> perf vendor events intel: Correct alderlake metrics
> perf jevents: Don't rewrite metrics across PMUs
> perf metrics: Be PMU specific in event match
>
> tools/perf/arch/x86/include/arch-tests.h | 1 +
> tools/perf/arch/x86/tests/Build | 1 +
> tools/perf/arch/x86/tests/arch-tests.c | 10 +
> tools/perf/arch/x86/tests/hybrid.c | 275 ++++++
> tools/perf/arch/x86/util/evlist.c | 4 +-
> tools/perf/builtin-list.c | 19 +-
> tools/perf/builtin-record.c | 13 +-
> tools/perf/builtin-stat.c | 73 +-
> tools/perf/builtin-top.c | 5 +-
> tools/perf/builtin-trace.c | 5 +-
> .../arch/x86/alderlake/adl-metrics.json | 275 +++---
> .../arch/x86/alderlaken/adln-metrics.json | 20 +-
> .../arch/x86/broadwell/bdw-metrics.json | 12 +
> .../arch/x86/broadwellde/bdwde-metrics.json | 12 +
> .../arch/x86/broadwellx/bdx-metrics.json | 12 +
> .../arch/x86/cascadelakex/clx-metrics.json | 12 +
> .../arch/x86/haswell/hsw-metrics.json | 12 +
> .../arch/x86/haswellx/hsx-metrics.json | 12 +
> .../arch/x86/icelake/icl-metrics.json | 23 +
> .../arch/x86/icelakex/icx-metrics.json | 23 +
> .../arch/x86/ivybridge/ivb-metrics.json | 12 +
> .../arch/x86/ivytown/ivt-metrics.json | 12 +
> .../arch/x86/jaketown/jkt-metrics.json | 12 +
> .../arch/x86/sandybridge/snb-metrics.json | 12 +
> .../arch/x86/sapphirerapids/spr-metrics.json | 23 +
> .../arch/x86/skylake/skl-metrics.json | 12 +
> .../arch/x86/skylakex/skx-metrics.json | 12 +
> .../arch/x86/tigerlake/tgl-metrics.json | 23 +
> tools/perf/pmu-events/jevents.py | 10 +-
> tools/perf/pmu-events/metric.py | 28 +-
> tools/perf/pmu-events/metric_test.py | 6 +-
> tools/perf/pmu-events/pmu-events.h | 2 +
> tools/perf/tests/evsel-roundtrip-name.c | 119 ++-
> tools/perf/tests/parse-events.c | 826 +++++++++---------
> tools/perf/tests/pmu-events.c | 12 +-
> tools/perf/tests/shell/stat.sh | 44 +
> tools/perf/util/Build | 1 -
> tools/perf/util/evlist.h | 1 -
> tools/perf/util/evsel.c | 30 +-
> tools/perf/util/evsel.h | 1 +
> tools/perf/util/metricgroup.c | 111 ++-
> tools/perf/util/metricgroup.h | 3 +-
> tools/perf/util/parse-events-hybrid.c | 214 -----
> tools/perf/util/parse-events-hybrid.h | 25 -
> tools/perf/util/parse-events.c | 646 ++++++--------
> tools/perf/util/parse-events.h | 61 +-
> tools/perf/util/parse-events.l | 108 +--
> tools/perf/util/parse-events.y | 222 ++---
> tools/perf/util/pmu-hybrid.c | 20 -
> tools/perf/util/pmu-hybrid.h | 1 -
> tools/perf/util/pmu.c | 16 +-
> tools/perf/util/pmu.h | 3 +
> tools/perf/util/pmus.c | 25 +-
> tools/perf/util/pmus.h | 3 +
> tools/perf/util/print-events.c | 85 +-
> tools/perf/util/stat-display.c | 6 +-
> 56 files changed, 1939 insertions(+), 1627 deletions(-)
> create mode 100644 tools/perf/arch/x86/tests/hybrid.c
> delete mode 100644 tools/perf/util/parse-events-hybrid.c
> delete mode 100644 tools/perf/util/parse-events-hybrid.h
>
> --
> 2.40.1.495.gc816e09b53d-goog
>

--

- Arnaldo