Re: [PATCH v2] perf stat: Introduce skippable evsels

From: Liang, Kan
Date: Wed Apr 19 2023 - 10:17:25 EST




On 2023-04-19 9:19 a.m., Ian Rogers wrote:
> On Wed, Apr 19, 2023 at 5:31 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>
>>
>>
>> On 2023-04-18 9:00 p.m., Ian Rogers wrote:
>>> On Tue, Apr 18, 2023 at 5:12 PM Ian Rogers <irogers@xxxxxxxxxx> wrote:
>>>>
>>>> On Tue, Apr 18, 2023 at 2:51 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2023-04-18 4:08 p.m., Ian Rogers wrote:
>>>>>> On Tue, Apr 18, 2023 at 11:19 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2023-04-18 11:43 a.m., Ian Rogers wrote:
>>>>>>>> On Tue, Apr 18, 2023 at 6:03 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2023-04-17 2:13 p.m., Ian Rogers wrote:
>>>>>>>>>> The json TopdownL1 is enabled if present unconditionally for perf stat
>>>>>>>>>> default. Enabling it on Skylake has multiplexing as TopdownL1 on
>>>>>>>>>> Skylake has multiplexing unrelated to this change - at least on the
>>>>>>>>>> machine I was testing on. We can remove the metric group TopdownL1 on
>>>>>>>>>> Skylake so that we don't enable it by default, there is still the
>>>>>>>>>> group TmaL1. To me, disabling TopdownL1 seems less desirable than
>>>>>>>>>> running with multiplexing - previously to get into topdown analysis
>>>>>>>>>> there has to be knowledge that "perf stat -M TopdownL1" is the way to
>>>>>>>>>> do this.
>>>>>>>>>
>>>>>>>>> To be honest, I don't think it's a good idea to remove the TopdownL1. We
>>>>>>>>> cannot remove it just because the new way cannot handle it. The perf
>>>>>>>>> stat default works well until 6.3-rc7. It's a regression issue of the
>>>>>>>>> current perf-tools-next.
>>>>>>>>
>>>>>>>> I'm not so clear it is a regression to consistently add TopdownL1 for
>>>>>>>> all architectures supporting it.
>>>>>>>
>>>>>>>
>>>>>>> Breaking the perf stat default is a regression.
>>>>>>
>>>>>> Breaking is overstating the use of multiplexing. The impact is less
>>>>>> accuracy in the IPC and branch misses default metrics,
>>>>>
>>>>> Inaccuracy is a breakage for the default.
>>>>
>>>> Can you present a case where this matters? The events are already not
>>>> grouped and so inaccurate for metrics.
>>>
>>> Removing CPUs without perf metrics from the TopdownL1 metric group is
>>> implemented here:
>>> https://lore.kernel.org/lkml/20230419005423.343862-6-irogers@xxxxxxxxxx/
>>> Note, this applies to pre-Icelake and atom CPUs as these also lack
>>> perf metric (aka topdown) events.
>>>
>>
>> That may give the end user the impression that the pre-Icelake doesn't
>> support the Topdown Level1 events, which is not true.
>>
>> I think perf should either keep it for all Intel platforms which
>> supports tma_L1_group, or remove the TopdownL1 name entirely for Intel
>> platform (let the end user use the tma_L1_group and the name exposed by
>> the kernel as before.).
>
> How does this work on hybrid systems? We will enable TopdownL1 because
> of the presence of perf metric (aka topdown) events but this will also
> enable TopdownL1 on the atom core.


This is the output from a hybrid system with current 6.3-rc7.

As you can see that the Topdown L1 and L2 are displayed for the big
core. No Topdown events are displayed for the atom core.

(BTW: The 99.15% is not multiplexing. I think it's because the perf stat
may starts from the big core and it takes a little bit time to run
something on the small core.)


$perf stat ./hybrid_triad_loop.sh

Performance counter stats for './hybrid_triad_loop.sh':

211.80 msec task-clock # 0.996 CPUs
utilized
5 context-switches # 23.608 /sec
3 cpu-migrations # 14.165 /sec
652 page-faults # 3.078 K/sec
411,470,713 cpu_core/cycles/ # 1.943 G/sec
607,566,483 cpu_atom/cycles/ # 2.869
G/sec (99.15%)
1,613,379,362 cpu_core/instructions/ # 7.618 G/sec
1,616,816,312 cpu_atom/instructions/ # 7.634
G/sec (99.15%)
202,876,952 cpu_core/branches/ # 957.884 M/sec
202,367,829 cpu_atom/branches/ # 955.480
M/sec (99.15%)
56,740 cpu_core/branch-misses/ # 267.898 K/sec
19,033 cpu_atom/branch-misses/ # 89.864
K/sec (99.15%)
2,468,765,562 cpu_core/slots/ # 11.656 G/sec
1,411,184,398 cpu_core/topdown-retiring/ # 57.4%
Retiring
4,671,159 cpu_core/topdown-bad-spec/ # 0.2% Bad
Speculation
92,222,378 cpu_core/topdown-fe-bound/ # 3.7%
Frontend Bound
952,516,107 cpu_core/topdown-be-bound/ # 38.7%
Backend Bound
2,696,347 cpu_core/topdown-heavy-ops/ # 0.1%
Heavy Operations # 57.2% Light Operations
4,460,659 cpu_core/topdown-br-mispredict/ # 0.2%
Branch Mispredict # 0.0% Machine Clears
19,538,486 cpu_core/topdown-fetch-lat/ # 0.8%
Fetch Latency # 3.0% Fetch Bandwidth
24,170,592 cpu_core/topdown-mem-bound/ # 1.0%
Memory Bound # 37.7% Core Bound

0.212598999 seconds time elapsed

0.212525000 seconds user
0.000000000 seconds sys


>
>>
>>> With that change I don't have a case that requires skippable evsels,
>>> and so we can take that series with patch 6 over the v1 of that series
>>> with this change.
>>>
>>
>> I'm afraid this is not the only problem the commit 94b1a603fca7 ("perf
>> stat: Add TopdownL1 metric as a default if present") in the
>> perf-tools-next branch introduced.
>>
>> The topdown L2 in the perf stat default on SPR and big core of the ADL
>> is still missed. I don't see a possible fix for this on the current
>> perf-tools-next branch.
>
> I thought in its current state the json metrics for TopdownL2 on SPR
> have multiplexing. Given L1 is used to drill down to L2, it seems odd
> to start on L2, but given L1 is used to compute the thresholds for L2,
> this should be to have both L1 and L2 on these platforms. However,
> that doesn't work as you don't want multiplexing.
>
> This all seems backward to avoid potential multiplexing on branch miss
> rate and IPC, just always having TopdownL1 seems cleanest with the
> skippable evsels working around the permissions issue - as put forward
> in this patch. Possibly adding L2 metrics on ADL/SPR, but only once
> the multiplexing issue is resolved.
>

No, not just that issue. Based to what I tested these days, perf stat
default has issues/regressions on most of the Intel platforms with the
current perf-tools-next and perf/core branch of acme's repo.

For the pre-ICL platforms:
- The permission issue. (This patch tried to address.)
- Unclean perf stat default. (This patch failed to address.)
Unnecessary multiplexing for cycles.
Display partial of the TopdownL1

https://lore.kernel.org/lkml/d1fe801a-22d0-1f9b-b127-227b21635bd5@xxxxxxxxxxxxxxx/

For SPR platforms
- Topdown L2 metrics is missed, while it works with the current 6.3-rc7.

For ADL/RPL platforms
- Segmentation fault which I just found this morning.
# ./perf stat true
Segmentation fault (core dumped)


After the test on a hybrid machine, I incline to revert the commit
94b1a603fca7 ("perf stat: Add TopdownL1 metric as a default if present")
and related patches for now.

To clarify, I do not object a generic solution for the Topdown on
different ARCHs. But the current generic solution aka TopdownL1 has all
kinds of problems on most of Intel platforms. We should fix them first
before applying to the mainline.

Thanks,
Kan