[RFC PATCH 0/2] A mechanism for efficient support for per-function metrics

From: Ben Gainey
Date: Tue Jan 23 2024 - 06:34:52 EST

Next message: Ben Gainey: "[RFC PATCH 1/2] arm_pmu: Allow the PMU to alternate between two sample_period values."
Previous message: David Hildenbrand: "Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64"
Next in thread: Ben Gainey: "[RFC PATCH 1/2] arm_pmu: Allow the PMU to alternate between two sample_period values."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I've been working on an approach to supporting per-function metrics for
aarch64 cores, which requires some changes to the arm_pmuv3 driver, and
I'm wondering if this approach would make sense as a generic feature
that could be used to enable the same on other architectures?

The basic idea is as follows:

* Periodically sample one or more counters as needed for the chosen
set of metrics.
* Record a sample count for each symbol so as to identify hot
functions.
* Accumulate counter totals for each of the counters in each of the
metrics *but* only do this where the previous sample's symbol
matches the current sample's symbol.

Discarding the counter deltas for any given sample is important to
ensure that couters are correctly attributed to a single function,
without this step the resulting metrics trend towards some average
value across the top 'n' functions. It is acknowledged that it is
possible for this heuristic to fail, for example if samples to land
either side of a nested call, so a sufficiently small sample window
over which the counters are collected must be used to reduce the risk
of misattribution.

So far, this can be acheived without any further modifications to perf
tools or the kernel. However as noted it requires the counter
collection window to be sufficiently small; in testing on
Neoverse-N1/-V1, a window of about 300 cycles gives good results. Using
the cycle counter with a sample_period of 300 is possible, but such an
approach generates significant amounts of capture data, and introduces
a lot of overhead and probe effect. Whilst the kernel will throttle
such a configuration, which helps to mitigate a large portion of the
bandwidth and capture overhead, it is not something that can be
controlled for on a per event basis, or for non-root users, and because
throttling is controlled as a percentage of time its affects vary from
machine to machine.

For this to work efficiently, it is useful to provide a means to
decouple the sample window (time over which events are counted) from
the sample period (time between interesting samples). This patcheset
modifies the Arm PMU driver to support alternating between two
sample_period values, providing a simple and inexpensive way for tools
to separate out the sample period and the sample window. It is expected
to be used with the cycle counter event, alternating between a long and
short period and subsequently discarding the counter data for samples
with the long period. The combined long and short period gives the
overall sampling period, and the short sample period gives the sample
window. The symbol taken from the sample at the end of the long period
can be used by tools to ensure correct attribution as described
previously. The cycle counter is recommended as it provides fair
temporal distribution of samples as would be required for the
per-symbol sample count mentioned previously, and because the PMU can
be programmed to overflow after a sufficiently short window; this may
not be possible with software timer (for example). This patch does not
restrict to only the cycle counter, it is possible there could be other
novel uses based on different events.

To test this I have developed a simple `perf script` based python
script. For a limited set of Arm PMU events it will post process a
`perf record`-ing and generate a table of metrics. Along side this I
have developed a benchmark application that rotates through a sequence
of different classes of behaviour that can be detected by the Arm PMU
(eg. mispredics, cache misses, different instruction mixes). The path
through the benchmark can be rotated after each iteration so as to
ensure the results don't land on some lucky harmonic with the sample
period. The script can be used with and without the strobing patch
allowing comparison of the results. Testing was on Juno (A53+A57),
N1SDP, Gravaton 2 and 3. In addition this approach has been applied to
a few of Arm's tools projects and has correctly identified improvements
and regressions.

Headline results from testing indicate that ~300 cycles sample window
gives good results with or without the strobing patch. When the
strobing patch is used, the resulting `perf.data` files are typically
25-50x smaller that without, and take ~25x less time for the python
script to post-process. Without strobing, the test application's
runtime was x20 slower when sampling every 300 cycles as compared to
every 1000000 cycles. With strobing enabled such that the long period
was 999700 cycles and the short period was 300 cycles, the test
applications runtime was only x1.2 slower than every 1000000 cycles.
Notably, without the patch, L1D cache miss rates are significantly
higher than with the patch, which we attribute to increased impact on
the cache that trapping into the kernel every 300 cycles has. These
results are given with `perf_cpu_time_max_percent=25`. When tested
with `perf_cpu_time_max_percent=100` the size and time comparisons are
more significant. Disabling throttling did not lead to obvious
improvements in the collected metrics, suggesting that the sampling
approach is sufficient to collect representative metrics.

Cursory testing on a Xeon(R) W-2145 sampling every 300 cycles (without
the patch) suggests this approach would work for some counters.
Calculating branch miss rates for example appears to be correct,
likewise UOPS_EXECUTED.THREAD seems to give something like a sensible
cycles-per-uop value. On the other hand the fixed function instructions
counter does not appear to sample correctly (it seems to report either
very small or very large numbers). No idea whats going on there, so any
insight welcome...

This patch modifies the arm_pmu and introduces an additional field in
config2 to configure the behaviour. If we think there is broad
applicability, would it make sense to move into perf_event_attr flags
or field, and possibly pull up into core? If we don't think so, then
some feedback around the core header and arm_pmu modifications would
be appreciated.

A copy of the post-processing script is available at
https://github.com/ARM-software/gator/blob/prototypes/strobing/prototypes/strobing-patches/test-script/generate-function-metrics.py
note that the script its self has a dependency on
https://lore.kernel.org/linux-perf-users/20240123103137.1890779-1-ben.gainey@xxxxxxx/

Ben Gainey (2):
arm_pmu: Allow the PMU to alternate between two sample_period values.
arm_pmuv3: Add config bits for sample period strobing

drivers/perf/arm_pmu.c | 74 +++++++++++++++++++++++++++++-------
drivers/perf/arm_pmuv3.c | 25 ++++++++++++
include/linux/perf/arm_pmu.h | 1 +
include/linux/perf_event.h | 10 ++++-
4 files changed, 95 insertions(+), 15 deletions(-)

--
2.43.0

Next message: Ben Gainey: "[RFC PATCH 1/2] arm_pmu: Allow the PMU to alternate between two sample_period values."
Previous message: David Hildenbrand: "Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64"
Next in thread: Ben Gainey: "[RFC PATCH 1/2] arm_pmu: Allow the PMU to alternate between two sample_period values."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]