Re: [RFC PATCH 0/6] perf: Add AUX data sampling

From: Andi Kleen
Date: Fri Sep 23 2016 - 18:34:35 EST

Next message: Rob Herring: "Re: [PATCH v4] devicetree: bindings: uart: Add new compatible string for ZynqMP"
Previous message: Daniel Borkmann: "Re: [PATCH 2/3] bpf powerpc: implement support for tail calls"
In reply to: Peter Zijlstra: "Re: [RFC PATCH 0/6] perf: Add AUX data sampling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Sep 23, 2016 at 10:35:27PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 23, 2016 at 10:19:43AM -0700, Andi Kleen wrote:
> > On Fri, Sep 23, 2016 at 01:49:17PM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 23, 2016 at 02:27:20PM +0300, Alexander Shishkin wrote:
> > > > Hi Peter,
> > > >
> > > > This is an RFC, I'm not sending the tooling bits in this series,
> > > > although they can be found here [1].
> > > >
> > > > This series introduces AUX data sampling for perf events, which in
> > > > case of our instruction/branch tracing PMUs like Intel PT, BTS, CS
> > > > ETM means execution flow history leading up to a perf event's
> > > > overflow.
> > >
> > > This fails to explain _WHY_ this is a good thing to have. What kind of
> > > analysis does this enable, and is that fully implemented in [1] (I
> > > didn't look).
> >
> > Think of it as a super LBR. (Near) all things LBR can do, PT can do
> > with much more branches for each sample.
>
> Clarify the 'near'? Should we then not expose it as a BRANCH_STACK?

- Exposing it as branch stack would need a PT decoder in kernel space.
A PT decoder is quite complicated and needs a lot of infrastructure.
- Putting a PT decoder in kernel space is bad because the decoder is
much slower than the execution and much better runs offline than online.
It would cause a lot more data loss.
- LBR has some features which are not in PT (but also other way round),
like mispredict indication, call stack mode or individual basic block level
timing. It also has practically no runtime overhead.
- Also BTW the pt decoder in user space already supporting exposing
PT as a virtual LBR. It is just done all without kernel help
after decoding.

> > Also long term execution recording of PT normally doesn't work well because the
> > sustained bandwidth is too high for perf and the disk to keep up
> >
> > Currently the main solution we have for that is the snapshot mode, but it
> > requires explicit instrumentation for someone to trigger snapshots.
> >
> > Sampling PT is an alternative that works for many use cases, and does
> > not rely on instrumentation.
>
> List a few use-cases on either side of that divide ?

- Snapshot mode is good for targeted performance debugging. You're looking
for something specific and can instrument for it.
- Sample mode is good for generic data collection. You don't know yet
what you're looking for, but want to see hot paths in your application.
- Sample mode also works for automated data collection, like using
it for compiler profile feedback.

-Andi

Next message: Rob Herring: "Re: [PATCH v4] devicetree: bindings: uart: Add new compatible string for ZynqMP"
Previous message: Daniel Borkmann: "Re: [PATCH 2/3] bpf powerpc: implement support for tail calls"
In reply to: Peter Zijlstra: "Re: [RFC PATCH 0/6] perf: Add AUX data sampling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]