Re: [PATCH v1 1/4] perf vendor events intel: Alderlake/rocketlake metric fixes

From: Liang, Kan
Date: Thu Jan 04 2024 - 14:30:16 EST




On 2024-01-04 12:51 p.m., Ian Rogers wrote:
> On Thu, Jan 4, 2024 at 6:30 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>
>>
>>
>> On 2024-01-04 8:56 a.m., Ian Rogers wrote:
>>>> Testing tma_slow_pause
>>>> Metric 'tma_slow_pause' not printed in:
>>>> # Running 'internals/synthesize' benchmark:
>>>> Computing performance of single threaded perf event synthesis by
>>>> synthesizing events on the perf process itself:
>>>> Average synthesis took: 49.987 usec (+- 0.049 usec)
>>>> Average num. events: 47.000 (+- 0.000)
>>>> Average time per event 1.064 usec
>>>> Average data synthesis took: 53.490 usec (+- 0.033 usec)
>>>> Average num. events: 245.000 (+- 0.000)
>>>> Average time per event 0.218 usec
>>>>
>>>> Performance counter stats for 'perf bench internals synthesize':
>>>>
>>>> <not counted> cpu_core/TOPDOWN.SLOTS/ (0.00%)
>>>> <not counted> cpu_core/topdown-retiring/ (0.00%)
>>>> <not counted> cpu_core/topdown-mem-bound/ (0.00%)
>>>> <not counted> cpu_core/topdown-bad-spec/ (0.00%)
>>>> <not counted> cpu_core/topdown-fe-bound/ (0.00%)
>>>> <not counted> cpu_core/topdown-be-bound/ (0.00%)
>>>> <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/ (0.00%)
>>>> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/ (0.00%)
>>>> <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/ (0.00%)
>>>> <not counted> cpu_core/CPU_CLK_UNHALTED.PAUSE/ (0.00%)
>>>> <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/ (0.00%)
>>>> <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/ (0.00%)
>>>> <not counted> cpu_core/ARITH.DIV_ACTIVE/ (0.00%)
>>>> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/ (0.00%)
>>>> <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/ (0.00%)
>>>>
>>>> 1.186254766 seconds time elapsed
>>>>
>>>> 0.427220000 seconds user
>>>> 0.752217000 seconds sys
>>>> Testing smi_cycles
>>>> Testing smi_num
>>>> Testing tsx_aborted_cycles
>>>> Testing tsx_cycles_per_elision
>>>> Testing tsx_cycles_per_transaction
>>>> Testing tsx_transactional_cycles
>>>> test child finished with -1
>>>> ---- end ----
>>>> perf all metrics test: FAILED!
>>>> root@number:~#
>>> Have a try disabling the NMI watchdog. Agreed that there is more to
>>> fix here but I think the PMU driver is in part to blame because
>>> manually breaking the weak group of events is a fix.
>>
>> I think we have a NO_GROUP_EVENTS_NMI metric constraint to mark a group
>> which require disabling of the NMI watchdog.
>> Maybe we should mark the group a NO_GROUP_EVENTS_NMI metric.
>
> +Weilin due to the affects of event grouping.
>
> Thanks Kan, NO_GROUP_EVENTS_NMI would be good. Something I see for
> tma_ports_utilized_1 that may be worsening things is:
>
> ```
> Testing tma_ports_utilized_1
> Metric 'tma_ports_utilized_1' not printed in:
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 49.581 usec (+- 0.030 usec)
> Average num. events: 47.000 (+- 0.000)
> Average time per event 1.055 usec
> Average data synthesis took: 53.367 usec (+- 0.032 usec)
> Average num. events: 246.000 (+- 0.000)
> Average time per event 0.217 usec
>
> Performance counter stats for 'perf bench internals synthesize':
>
> <not counted> cpu_core/TOPDOWN.SLOTS/
> (0.00%)
> <not counted> cpu_core/topdown-retiring/
> (0.00%)
> <not counted> cpu_core/topdown-mem-bound/
> (0.00%)
> <not counted> cpu_core/topdown-bad-spec/
> (0.00%)
> <not counted> cpu_core/topdown-fe-bound/
> (0.00%)
> <not counted> cpu_core/topdown-be-bound/
> (0.00%)
> <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
> (0.00%)
> <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/
> (0.00%)
> <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/
> (0.00%)
> <not counted> cpu_core/ARITH.DIV_ACTIVE/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
> (0.00%)
>
> 1.180394056 seconds time elapsed
>
> 0.409881000 seconds user
> 0.764134000 seconds sys
> ```
>
> The event EXE_ACTIVITY.1_PORTS_UTIL is repeated, this is because the
> metric code deduplicates events based purely on their name and so
> doesn't realize EXE_ACTIVITY.1_PORTS_UTIL is the same as
> cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@. This is a hybrid only glitch as
> we only prefix with a PMU for hybrid metrics, and I should find and
> remove why there's no PMU for the 1 case of EXE_ACTIVITY.1_PORTS_UTIL.
>
> This problem doesn't occur for tma_slow_pause and I wondered if you
> could give insight. That metric has the counters below:
> ```
> $ perf stat -M tma_slow_pause -a sleep 0.1
>
> Performance counter stats for 'system wide':
>
> <not counted> cpu_core/TOPDOWN.SLOTS/
> (0.00%)
> <not counted> cpu_core/topdown-retiring/
> (0.00%)
> <not counted> cpu_core/topdown-mem-bound/
> (0.00%)
> <not counted> cpu_core/topdown-bad-spec/
> (0.00%)
> <not counted> cpu_core/topdown-fe-bound/
> (0.00%)
> <not counted> cpu_core/topdown-be-bound/
> (0.00%)
> <not counted> cpu_core/RESOURCE_STALLS.SCOREBOARD/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/
> (0.00%)
> <not counted> cpu_core/CPU_CLK_UNHALTED.PAUSE/
> (0.00%)
> <not counted> cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/
> (0.00%)
> <not counted> cpu_core/CPU_CLK_UNHALTED.THREAD/
> (0.00%)
> <not counted> cpu_core/ARITH.DIV_ACTIVE/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/
> (0.00%)
> <not counted> cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
> (0.00%)
>
> 0.102074888 seconds time elapsed
> ```
>
> With -vv I see the event string is:
> '{RESOURCE_STALLS.SCOREBOARD/metric-id=RESOURCE_STALLS.SCOREBOARD/,cpu_core/EXE_ACTIVITY.1_PORTS_UTIL,metric-id=cpu_core!3EXE_ACTIVITY.1_PORTS_UTIL!3/,cpu_core/TOPDOWN.SLOTS,metric-id=cpu_core!3TOPDOWN.SLOTS!3/,cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS,metric-id=cpu_core!3EXE_ACTIVITY.BOUND_ON_LOADS!3/,cpu_core/topdown-retiring,metric-id=cpu_core!3topdown!1retiring!3/,cpu_core/topdown-mem-bound,metric-id=cpu_core!3topdown!1mem!1bound!3/,cpu_core/topdown-bad-spec,metric-id=cpu_core!3topdown!1bad!1spec!3/,CPU_CLK_UNHALTED.PAUSE/metric-id=CPU_CLK_UNHALTED.PAUSE/,cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL,metric-id=cpu_core!3CYCLE_ACTIVITY.STALLS_TOTAL!3/,cpu_core/CPU_CLK_UNHALTED.THREAD,metric-id=cpu_core!3CPU_CLK_UNHALTED.THREAD!3/,cpu_core/ARITH.DIV_ACTIVE,metric-id=cpu_core!3ARITH.DIV_ACTIVE!3/,cpu_core/topdown-fe-bound,metric-id=cpu_core!3topdown!1fe!1bound!3/,cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc,metric-id=cpu_core!3EXE_ACTIVITY.2_PORTS_UTIL!0umask!20xc!3/,cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80,metric-id=cpu_core!3EXE_ACTIVITY.3_PORTS_UTIL!0umask!20x80!3/,cpu_core/topdown-be-bound,metric-id=cpu_core!3topdown!1be!1bound!3/}:W'
>
> which without the metric-ids becomes:
> '{RESOURCE_STALLS.SCOREBOARD,cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/,cpu_core/TOPDOWN.SLOTS/,cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/,cpu_core/topdown-retiring/,cpu_core/topdown-mem-bound/,cpu_core/topdown-bad-spec/,CPU_CLK_UNHALTED.PAUSE,cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/,cpu_core/CPU_CLK_UNHALTED.THREAD/,cpu_core/ARITH.DIV_ACTIVE/,cpu_core/topdown-fe-bound/,cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/,cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/,cpu_core/topdown-be-bound/}:W'
>
> I count 9 none slots/top-down counters there, but I see
> CPU_CLK_UNHALTED.THREAD can use fixed counter 1. Should
> perf_event_open fail for a CPU that has a pinned use of a fixed
> counter and the group needs the fixed counter?

I tried, but the idea was rejected.

> I'm guessing you don't
> want this as CPU_CLK_UNHALTED.THREAD can also go on a generic counter
> and the driver doesn't want to count counter usage, it seems feasible
> to add it though. I guess we need a NO_GROUP_EVENTS_NMI whenever
> CPU_CLK_UNHALTED.THREAD is an event and 8 generic counters are in use.

Yes, it looks good to me.

>
> Checking on Tigerlake I see:
> ```
> $ perf stat -M tma_slow_pause -a sleep 0.1
>
> Performance counter stats for 'system wide':
>
> 105,210,913 TOPDOWN.SLOTS # 0.1 %
> tma_slow_pause (72.65%)
> 6,701,129 topdown-retiring
> (72.65%)
> 52,359,712 topdown-fe-bound
> (72.65%)
> 32,904,532 topdown-be-bound
> (72.65%)
> 14,117,814 topdown-bad-spec
> (72.65%)
> 6,602,391 RESOURCE_STALLS.SCOREBOARD
> (76.17%)
> 4,220,773 cpu/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
> (76.73%)
> 421,812 EXE_ACTIVITY.BOUND_ON_STORES
> (76.69%)
> 5,164,088 EXE_ACTIVITY.1_PORTS_UTIL
> (76.70%)
> 299,681 cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/
> (76.69%)
> 245 MISC_RETIRED.PAUSE_INST
> (76.67%)
> 58,403,687 CPU_CLK_UNHALTED.THREAD
> (76.72%)
> 25,297,841 CYCLE_ACTIVITY.STALLS_MEM_ANY
> (76.67%)
> 3,788,772 EXE_ACTIVITY.2_PORTS_UTIL
> (62.69%)
> 20,973,875 CYCLE_ACTIVITY.STALLS_TOTAL
> (62.16%)
> 68,053 ARITH.DIVIDER_ACTIVE
> (62.18%)
>
> 0.102624327 seconds time elapsed
> ```
> so 10 generic counters which would never fit and the weak group is
> broken - the difference in the metric explaining why I've not been
> seeing the issue. I think I need to add alderlake/sapphirerapids
> constraints here:
> https://github.com/captain5050/perfmon/blob/main/scripts/create_perf_json.py#L1382
> Ideally we'd automate the constraint generation (or the PMU driver
> would help us out by failing to open the weak group).

Yes, an automation will be great. The NO_GROUP_EVENTS_NMI can be set for
a group which has CPU_CLK_UNHALTED.THREAD and the number of core events
(expect topdown) == the max number of GP counters + 1.

Thanks,
Kan
>
> Thanks,
> Ian
>
>
>> Thanks,
>> Kan
>>
>>> Fwiw, if we
>>> switch to the buddy watchdog mechanism then we'll no longer need to
>>> disable the NMI watchdog:
>>> https://lore.kernel.org/lkml/20230421155255.1.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid/
>>
>