Re: [RFC PATCH 0/2] A mechanism for efficient support for per-function metrics

From: Ben Gainey
Date: Wed Feb 14 2024 - 14:14:19 EST


Hi Andi

Thanks for commenting...

On Wed, 2024-02-14 at 01:55 -0800, Andi Kleen wrote:
> Ben Gainey <ben.gainey@xxxxxxx> writes:
>
> > I've been working on an approach to supporting per-function metrics
> > for
> > aarch64 cores, which requires some changes to the arm_pmuv3 driver,
> > and
> > I'm wondering if this approach would make sense as a generic
> > feature
> > that could be used to enable the same on other architectures?
> >
> > The basic idea is as follows:
> >
> >  * Periodically sample one or more counters as needed for the
> > chosen
> >    set of metrics.
> >  * Record a sample count for each symbol so as to identify hot
> >    functions.
> >  * Accumulate counter totals for each of the counters in each of
> > the
> >    metrics *but* only do this where the previous sample's symbol
> >    matches the current sample's symbol.
>
> It sounds very similar to what perf script -F +metric already does
> (or did if it wasn't broken currently). It would be a straight
> forward
> extension here to add this "same as previous" check.


Nice, I wasn't aware of this feature. I'll have a play...


>
> Of course the feature is somewhat dubious in that it will have a very
> strong systematic bias against short functions and even long
> functions
> in some alternating execution patterns. I assume you did some
> experiments to characterize this. It would be important
> to emphasize this in any documentation.

The way I have been thinking about this is that for each sample you
always maintain a periodic sample count so that the relative ranking of
functions is maintained, and that the "same as previous" check is a way
to enhance the attributability of the PMU data for any given sample.

But it absolutely correct to say that this will bias the availability
of PMU data in the way you have describe. The bias depends on sample
window size, workload characteristics and so on.

It should be possible to provide a per metric "valid sample" count that
can be used to judge the "quality" of the metrics for each symbol,
which may allow the user to make some adjustments to the recording
paramters (modify sample period, or sample window size for example).

I'll have a think about the best way to convey this in docs. I have a
few ideas for ways to further impove the attributability / identify
cases where short functions are missed, but they'd not affect the
implementation of anything in the kernel, just perhaps the tool's post-
processing.


>
> > For this to work efficiently, it is useful to provide a means to
> > decouple the sample window (time over which events are counted)
> > from
> > the sample period (time between interesting samples). This
> > patcheset
> > modifies the Arm PMU driver to support alternating between two
> > sample_period values, providing a simple and inexpensive way for
> > tools
> > to separate out the sample period and the sample window. It is
> > expected
> > to be used with the cycle counter event, alternating between a long
> > and
> > short period and subsequently discarding the counter data for
> > samples
> > with the long period. The combined long and short period gives the
> > overall sampling period, and the short sample period gives the
> > sample
> > window. The symbol taken from the sample at the end of the long
> > period
> > can be used by tools to ensure correct attribution as described
> > previously. The cycle counter is recommended as it provides fair
> > temporal distribution of samples as would be required for the
> > per-symbol sample count mentioned previously, and because the PMU
> > can
> > be programmed to overflow after a sufficiently short window; this
> > may
> > not be possible with software timer (for example). This patch does
> > not
> > restrict to only the cycle counter, it is possible there could be
> > other
> > novel uses based on different events.
>
> I don't see anything ARM specific with the technique, so if it's done
> it should be done generically IMHO


Great. When i was originally thinking about the implementation of the
event strobing feature I was thinking:

* Add `strobe_sample` flag bit to opt into the fature
- This will be mutually exclusive with `freq`.
* Add `strobe_period` field to hold the alternate sample period (for
the sample window.
* Have all PMU drivers check and reject the `strobe_sample` flag by
default; the swizzling of the period will be done in the PMU driver its
self if it make sense to support this feature for a given PMU.
- Do you think this is sensible, or would be better handled in core?

Any obvious issues with this approach?

>
>
> > Cursory testing on a Xeon(R) W-2145 sampling every 300 cycles
> > (without
> > the patch) suggests this approach would work for some counters.
> > Calculating branch miss rates for example appears to be correct,
> > likewise UOPS_EXECUTED.THREAD seems to give something like a
> > sensible
> > cycles-per-uop value. On the other hand the fixed function
> > instructions
> > counter does not appear to sample correctly (it seems to report
> > either
> > very small or very large numbers). No idea whats going on there, so
> > any
> > insight welcome...
>
> If you use precise samples with 3p there is a restriction on the
> periods
> that is enforced by the kernel. Non precise or single/double p should
> support arbitrary, except that any p is always period + 1.

Is there some default value for precise? when testing I didn't set any
specific value for p modifier.


>
> One drawback of the technique on x86 is that it won't allow multi
> record
> pebs (collecting samples without interrupts), so the overhead might
> be intrinsically higher.
>
> -Andi


Sure, I think this kind of detail was why I was thinking it should be
the PMU driver rather than core that handles the strobing feature,
since there may be other considerations / better ways to collect the
metrics samples.

Regards
Ben