Re: [RFC PATCH 0/2] A mechanism for efficient support for per-function metrics

From: Ben Gainey
Date: Sun Mar 10 2024 - 09:00:36 EST


On Wed, 2024-02-14 at 23:08 -0800, Andi Kleen wrote:
> > On Wed, Feb 14, 2024 at 07:13:50PM +0000, Ben Gainey wrote:
> > > >
> > > > Nice, I wasn't aware of this feature. I'll have a play...
> >
> > You have to use an old perf version for now, still need to fix it.
> > > >
> > > >
> > > > > >
> > > > > > Of course the feature is somewhat dubious in that it will
> > > > > > have a > > > very
> > > > > > strong systematic bias against short functions and even
> > > > > > long
> > > > > > functions
> > > > > > in some alternating execution patterns. I assume you did
> > > > > > some
> > > > > > experiments to characterize this. It would be important
> > > > > > to emphasize this in any documentation.
> > > >
> > > > The way I have been thinking about this is that for each sample
> > > > you
> > > > always maintain a periodic sample count so that the relative >
> > > > > ranking of
> > > > functions is maintained, and that the "same as previous" check
> > > > is a > > way
> > > > to enhance the attributability of the PMU data for any given >
> > > > > sample.
> > > >
> > > > But it absolutely correct to say that this will bias the > >
> > > > availability
> > > > of PMU data in the way you have describe. The bias depends on >
> > > > > sample
> > > > window size, workload characteristics and so on.
> >
> > I would be more comfortable with it if you added some randomization
> > on the window sizes. That would limit bias and worst case sampling
> > error.

Agreed. I think for the Arm PMU maybe there is not the same "precise"
behaviour but its certainly something that should be possible to
implement.


> >
> > > > It should be possible to provide a per metric "valid sample"
> > > > count > > that
> > > > can be used to judge the  "quality" of the metrics for each
> > > > symbol,
> > > > which may allow the user to make some adjustments to the
> > > > recording
> > > > paramters (modify sample period, or sample window size for > >
> > > > example).
> >
> > Even that would be misleading because it assumes that the IP stayed
> > in the same function between the two samples. But you could have
> > something like
> >
> > F1  sample
> > F2
> > F1  sample
> >
> > and if you're unlucky this could happen systematically. The
> > randomization would fight it somewhat, but even there you might
> > be very very unlucky.
> >
> > The only sure way to judge it really is to run branch trace in >
> > parallel and see
> > if it is correct.

True. I've been looking at using the function return PMU counter to
identify this case and filtering accordingly. Needs a bit more work but
seems promising.


> >
> > Also there is of course the problem that on a modern core the
> > reordering window might well be larger than your sample window, so
> > > any notion
> > of things happening inside a short window is quite fuzzy.


True. In the absense of an alternative hardware mechanism though, I'm
aiming for representative metrics rather than precise metrics. That is
to say, metrics attributable to functions with a sufficient degree of
accuracy to be meaningful/actionable. In this context the fuzziness is
acceptable.


> >
> > > > > >
> > > > > > I don't see anything ARM specific with the technique, so if
> > > > > > it's > > > done
> > > > > > it should be done generically IMHO
> > > >
> > > >
> > > > Great. When i was originally thinking about the implementation
> > > > of > > the
> > > > event strobing feature I was thinking:
> > > >
> > > >  * Add `strobe_sample` flag bit to opt into the fature
> > > >    - This will be mutually exclusive with `freq`.
> > > >  * Add `strobe_period` field to hold the alternate sample
> > > > period > > (for
> > > > the sample window.
> > > >  * Have all PMU drivers check and reject the `strobe_sample`
> > > > flag > > by
> > > > default; the swizzling of the period will be done in the PMU
> > > > driver > > its
> > > > self if it make sense to support this feature for a given PMU.
> > > >    - Do you think this is sensible, or would be better handled
> > > > in > > core?
> >
> > I would have a common function in core that is called from the PMU
> > drivers, similar to how the adaptive period is done today.

Ack, thanks.

Regards
Ben