Re: [PATCH v2] perf stat: Introduce skippable evsels

From: Ian Rogers
Date: Tue Apr 18 2023 - 16:09:22 EST

Next message: Florian Fainelli: "Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks"
Previous message: kernel test robot: "Re: [PATCH v14 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs"
In reply to: Liang, Kan: "Re: [PATCH v2] perf stat: Introduce skippable evsels"
Next in thread: Liang, Kan: "Re: [PATCH v2] perf stat: Introduce skippable evsels"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Apr 18, 2023 at 11:19 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>
>
>
> On 2023-04-18 11:43 a.m., Ian Rogers wrote:
> > On Tue, Apr 18, 2023 at 6:03 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 2023-04-17 2:13 p.m., Ian Rogers wrote:
> >>> The json TopdownL1 is enabled if present unconditionally for perf stat
> >>> default. Enabling it on Skylake has multiplexing as TopdownL1 on
> >>> Skylake has multiplexing unrelated to this change - at least on the
> >>> machine I was testing on. We can remove the metric group TopdownL1 on
> >>> Skylake so that we don't enable it by default, there is still the
> >>> group TmaL1. To me, disabling TopdownL1 seems less desirable than
> >>> running with multiplexing - previously to get into topdown analysis
> >>> there has to be knowledge that "perf stat -M TopdownL1" is the way to
> >>> do this.
> >>
> >> To be honest, I don't think it's a good idea to remove the TopdownL1. We
> >> cannot remove it just because the new way cannot handle it. The perf
> >> stat default works well until 6.3-rc7. It's a regression issue of the
> >> current perf-tools-next.
> >
> > I'm not so clear it is a regression to consistently add TopdownL1 for
> > all architectures supporting it.
>
>
> Breaking the perf stat default is a regression.

Breaking is overstating the use of multiplexing. The impact is less
accuracy in the IPC and branch misses default metrics, if multiplexing
is necessary on your Intel architecture. I believe TopdownL1 is more
useful than either of these metrics and so having TopdownL1 be a
default is an improvement. We can add a patch, as I keep repeating
this is off-topic for this patch, to make it so that TopdownL1 isn't
enabled on Intel CPUs pre-Icelake, but I would discourage this.

> The reason we once added the TopdownL1 for ICL and later platform is
> that it doesn't break the original design (a *clean* output).

Right, and in 6.3-rc7 the aggregation of counts was broken because of
duplicated counts and hard coded metrics (I did a last minute partial
fix). In perf-tools-next aggregation was fixed and we switched to the
json metrics, that are accurate and up-to-date with the latest TMA
metrics, so that we wouldn't need to maintain a duplicate code path.
What keys enabling TopdownL1 in 6.3 is the presence of topdown events
whilst in perf-tools-next it is the presence of TopdownL1 metric
group, as this is a more consistent approach and had first been
proposed by ARM.

> > The assertion is that because
> > TopdownL1 has multiplexing and multiplexing is global then we've
> > lowered the accuracy of other metrics, but the only other hardware
> > metrics in the default output pre-Icelake are IPC and branch miss
> > rate. Having TopdownL1 is a way of drilling into performance issues,
> > while IPC and branch miss rate are putting your finger in the air to
> > see which way the wind is blowing. Perhaps we should drop these if
> > TopdownL1 is present.
> >
> >> But I'm OK to add some flags with the metrics to assist the perf tool to
> >> specially handle the case if you prefer to modify the event list.
> >
> > We've already removed thresholds from the default output, we could
> > remove groups.
> >
> >>>
> >>> This doesn't relate to this change which is about making it so that
> >>> failing to set up TopdownL1 doesn't cause an early exit. The reason I
> >>> showed TigerLake output was that on TigerLake the skip/fallback
> >>> approach works while Skylake just needs the events disabled/skipped
> >>> unless it has sufficient permissions. Note the :u on the events in:
> >>
> >> The perf_event_open() should be good to detect the insufficient
> >> permission, but it doesn't work to detect an existing of an event.
> >> That's because the kernel only checks the features not specific events.
> >> It's not a reliable way to rely on the output of the perf_event_open() here.
> >
> > I'm unclear why not as not having perf_event_open fail seems like a
> > kernel bug. I can see there is little motivation to fix this on older
> > architectures like Skylake.
>
> Updating kernel may not be an option here. Because
> - There is no issue with the existing perf tool until 6.3-rc7. It
> doesn't sound like a defect of the kernel.
> - The kernel for the SKL has been there for many years. It's impossible
> to change all the kernels to support a new requirement from the perf tool.

This isn't a new behavior in the perf tool, the expectation that
perf_event_open fails is one that is present in the implementation of
weak event groups.

> > We do attempt to workaround it with the
> > errata flags on the metrics introduced here:
> > https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py#L1296
> >
> If so, I would still suggest to check the slots and topdown events in
> sysfs and decide whether to append the TopdownL1 to perf stat default.
>
> So we probably need a ARCH specific is_event_available(), before
> appending the events.

This can be done, but is a feature request. I think we should enable
TopdownL1 by default as the way to avoid multiplexing, if you care or
are affected, is to choose the specific event or metric of interest
and profile with just that. I suspect for branch miss-rate and IPC,
the numbers wanting to do that will be small. Educating that TopdownL1
exists is a harder problem and shouldn't just be something present for
Intel Icelake+ users or those who've read through the wiki.

Thanks,
Ian

> >>>> From your test result in the v2 description, we can see that there is no
> >>>> TopdownL1, which is good and expected. However, there is a (48.99%) with
> >>>> cycles:u event, which implies multiplexing. Could you please check
> >>>> what's the problem here?
> >>>> Also, if it's because of the backgroud, all the events should be
> >>>> multiplexing. But it looks like only cycle:u has multiplexing. Other
> >>>> events, instructions:u, branches:u and branch-misses:u work without
> >>>> multiplexing. That's very strange.
> >>> I wasn't able to reproduce it and suspect it was a transient thing. I
> >>> think there are multiplexing things to look into, 2 events on a fixed
> >>> counter on Icelake+ will trigger multiplexing on all counters, and
> >>> Skylake's 3 fixed and 4 generic should fit TopdownL1.
> >>
> >> Just found a cascade lake. With this patch + the current
> >> perf-tools-next, partial of the TopdownL1 and multiplexing can still be
> >> observed.
> >>
> >> $ sudo ./perf stat true
> >>
> >> Performance counter stats for 'true':
> >>
> >> 2.91 msec task-clock # 0.316 CPUs
> >> utilized
> >> 0 context-switches # 0.000 /sec
> >> 0 cpu-migrations # 0.000 /sec
> >> 45 page-faults # 15.474 K/sec
> >> 2,819,972 cycles # 0.970 GHz
> >> (60.14%)
> >> 5,391,406 instructions # 1.91 insn
> >> per cycle
> >> 1,068,575 branches # 367.442 M/sec
> >> 8,455 branch-misses # 0.79% of
> >> all branches
> >> 70,283 CPU_CLK_UNHALTED.REF_XCLK # 24.168
> >> M/sec
> >> 48,806 INT_MISC.RECOVERY_CYCLES_ANY # 16.783
> >> M/sec (39.86%)
> >>
> >> 0.009204517 seconds time elapsed
> >>
> >> 0.000000000 seconds user
> >> 0.009614000 seconds sys
> >
> > The issue here is that 'true' ran very quickly and so we skipped the
> > output of the events with 0 counts, no metrics were computed due to
> > the zero counts. Cascade lake has the same TopdownL1 multiplexing
> > issues as Skylake.
> >
>
> I tried other benchmark. I can still observe the multiplexing. But my
> remote machine just crashed. I'm asking the tech support. So I cannot do
> more tests.
>
> Thanks,
> Kan

Next message: Florian Fainelli: "Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks"
Previous message: kernel test robot: "Re: [PATCH v14 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs"
In reply to: Liang, Kan: "Re: [PATCH v2] perf stat: Introduce skippable evsels"
Next in thread: Liang, Kan: "Re: [PATCH v2] perf stat: Introduce skippable evsels"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]