Re: [PATCH 00/12] perf_events: add support for sampling takenbranches (v2)

From: Stephane Eranian
Date: Sun Dec 04 2011 - 15:11:41 EST


Any update on this patchset?

On Fri, Oct 14, 2011 at 5:37 AM, Stephane Eranian <eranian@xxxxxxxxxx> wrote:
> This patchset adds an important and useful new feature to
> perf_events: branch stack sampling. In other words, the
> ability to capture taken branches into each sample.
>
> Statistical sampling of taken branch should not be confused
> for branch tracing. Not all branches are necessarily captured
>
> Sampling taken branches is important for basic block profiling,
> statistical call graph, function call counts. Many of those
> measurements can help drive a compiler optimizer.
>
> The branch stack is a software abstraction which sits on top
> of the PMU hardware. As such, it is not available on all
> processors. For now, the patch provides the generic interface
> and the Intel X86 implementation where it leverages the Last
> Branch Record (LBR) feature (from Core2 to SandyBridge).
>
> Branch stack sampling is supported for both per-thread and
> system-wide modes.
>
> It is possible to filter the type and privilege level of branches
> to sample. The target of the branch is used to determine
> the privilege level.
>
> For each branch, the source and destination are captured. On
> some hardware platforms, it may be possible to also extract
> the target prediction and, in that case, it is also exposed
> to end users.
>
> The branch stack can record a variable number of taken
> branches per sample. Those branches are always consecutive
> in time. The number of branches captured depends on the
> filtering and the underlying hardware. On Intel Nehalem
> and later, up to 16 consecutive branches can be captured
> per sample.
>
> Branch sampling is always coupled with an event. It can
> be any PMU event but it can't be a SW or tracepoint event.
>
> Branch sampling is requested by setting a new sample_type
> flag called: PERF_SAMPLE_BRANCH_STACK.
>
> To support branch filtering, we introduce a new field
> to the perf_event_attr struct: branch_sample_type. We chose
> NOT to overload the config1, config2 field because those
> are related to the event encoding. Branch stack is a
> separate feature which is combined with the event.
>
> The branch_sample_type is a bitmask of possible filters.
> The following filters are defined (more can be added):
> - PERF_SAMPLE_BRANCH_ANY Â Â : any control flow change
> - PERF_SAMPLE_BRANCH_USER Â Â: capture branches when target is at user level
> - PERF_SAMPLE_BRANCH_KERNEL Â: capture branches when target is at user level
> - PERF_SAMPLE_BRANCH_ANY_CALL: capture call branches (incl. syscalls)
> - PERF_SAMPLE_BRANCH_ANY_RET : capture return branches (incl. syscall returns)
> - PERF_SAMPLE_BRANCH_IND_CALL: capture indirect calls
>
> It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.
>
> When the privilege level is not specified, the branch stack
> inherits that of the associated event.
>
> Some processors may not offer hardware branch filtering, e.g., Intel
> Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
> X86 implementation in this patchset also provides a SW branch filter
> which works on a best effort basis. It can compensate for the lack
> of LBR filtering. But first and foremost, it helps work around LBR
> filtering errata. The goal is to only capture the type of branches
> requested by the user.
>
> It is possible to combine branch stack sampling with PEBS on Intel
> X86 processors. Depending on the precise_sampling mode, there are
> certain filterting restrictions. When precise_sampling=1, then
> there are no filtering restrictions. When precise_sampling > 1,
> then only ANY|USER|KERNEL filter can be used. This comes from
> the fact that the kernel uses LBR to compensate for the PEBS
> off-by-1 skid on the instruction pointer.
>
> To demonstrate how the perf_event branch stack sampling interface
> works, the patchset also modifies perf record to capture taken
> branches. Similarly perf report is enhanced to display a histogram
> of taken branches.
>
> I would like to thank Roberto Vitillo @ LBL for his work on the perf
> tool for this.
>
> Enough talking, let's take a simple example. Our trivial test program
> goes like this:
>
> void f2(void)
> {}
> void f3(void)
> {}
> void f1(unsigned long n)
> {
> Âif (n & 1UL)
> Â Âf2();
> Âelse
> Â Âf3();
> }
> int main(void)
> {
> Âunsigned long i;
>
> Âfor (i=0; i < N; i++)
> Â f1(i);
> Âreturn 0;
> }
>
> $ perf record -b any branchy
> $ perf report -b
> # Events: 23K cycles
> #
> # Overhead ÂSource Symbol   Target Symbol
> # ........ Â................ Â................
>
> Â Â18.13% Â[.] f1 Â Â Â Â Â Â[.] main
>  Â18.10% Â[.] main     Â[.] main
>  Â18.01% Â[.] main     Â[.] f1
> Â Â15.69% Â[.] f1 Â Â Â Â Â Â[.] f1
> Â Â 9.11% Â[.] f3 Â Â Â Â Â Â[.] f1
> Â Â 6.78% Â[.] f1 Â Â Â Â Â Â[.] f3
> Â Â 6.74% Â[.] f1 Â Â Â Â Â Â[.] f2
> Â Â 6.71% Â[.] f2 Â Â Â Â Â Â[.] f1
>
> Of the total number of branches captured, 18.13% were from f1() -> main().
>
> Let's make this clearer by filtering the user call branches only:
>
> $ perf record -b any_call -e cycles:u branchy
> $ perf report
> # Events: 19K cycles
> #
> # Overhead ÂSource Symbol       ÂTarget Symbol
> # ........ Â......................... Â.........................
> #
>  Â52.50% Â[.] main          [.] f1
> Â Â23.99% Â[.] f1 Â Â Â Â Â Â Â Â Â Â [.] f3
> Â Â23.48% Â[.] f1 Â Â Â Â Â Â Â Â Â Â [.] f2
>   0.03% Â[.] _IO_default_xsputn   [.] _IO_new_file_overflow
>   0.01% Â[k] _start         [k] __libc_start_main
>
> Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
> The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
> that f1() dispatches based on odd vs. even values of n which is constantly increasing.
>
>
> In version 2, we update the patch to tip/master (commit 5734857) and
> we've incoporated the feedback from v1 concerning anynous bitfield
> struct for branch_stack_entry and the hanlding of i386 ABI binaries
> on 64-bit host in the instr decoder for the LBR SW filter.
>
> Signed-off-by: Stephane Eranian <eranian@xxxxxxxxxx>
>
>
> Roberto Agostino Vitillo (2):
> Âperf: add support for sampling taken branch to perf record
> Âperf: add support for taken branch sampling to perf report
>
> Stephane Eranian (10):
> Âperf_events: add generic taken branch sampling support
> Âperf_events: add Intel LBR MSR definitions
> Âperf_events: add Intel X86 LBR sharing logic
> Âperf_events: sync branch stack sampling with X86 precise_sampling
> Âperf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters
> Âperf_events: implement PERF_SAMPLE_BRANCH for Intel X86
> Âperf_events: add LBR software filter support for Intel X86
> Âperf_events: disable PERF_SAMPLE_BRANCH_* when not supported
> Âperf_events: add hook to flush branch_stack on context switch
> Âperf: add code to support PERF_SAMPLE_BRANCH_STACK
>
> Âarch/alpha/kernel/perf_event.c       |  Â4 +
> Âarch/arm/kernel/perf_event.c        |  Â4 +
> Âarch/mips/kernel/perf_event.c       Â|  Â4 +
> Âarch/powerpc/kernel/perf_event.c      |  Â4 +
> Âarch/sh/kernel/perf_event.c        Â|  Â4 +
> Âarch/sparc/kernel/perf_event.c       |  Â4 +
> Âarch/x86/include/asm/msr-index.h      |  Â7 +
> Âarch/x86/kernel/cpu/perf_event.c      |  62 +++-
> Âarch/x86/kernel/cpu/perf_event_amd.c    |  Â3 +
> Âarch/x86/kernel/cpu/perf_event_intel.c   | Â126 +++++--
> Âarch/x86/kernel/cpu/perf_event_intel_ds.c Â| Â 21 +-
> Âarch/x86/kernel/cpu/perf_event_intel_lbr.c | Â529 ++++++++++++++++++++++++++--
> Âinclude/linux/perf_event.h         |  74 ++++-
> Âkernel/events/core.c            | Â167 +++++++++
> Âkernel/events/hw_breakpoint.c       Â|  Â6 +
> Âtools/perf/Documentation/perf-record.txt  |  18 +
> Âtools/perf/Documentation/perf-report.txt  |  Â7 +
> Âtools/perf/builtin-record.c        Â|  75 ++++
> Âtools/perf/builtin-report.c        Â|  93 +++++-
> Âtools/perf/perf.h             Â|  17 +
> Âtools/perf/util/annotate.c         |  Â2 +-
> Âtools/perf/util/event.h          Â|  Â1 +
> Âtools/perf/util/evsel.c          Â|  10 +
> Âtools/perf/util/hist.c           |  97 ++++--
> Âtools/perf/util/hist.h           |  Â6 +
> Âtools/perf/util/session.c         Â|  72 ++++
> Âtools/perf/util/session.h         Â|  Â5 +
> Âtools/perf/util/sort.c           | Â348 ++++++++++++++-----
> Âtools/perf/util/sort.h           |  Â5 +
> Âtools/perf/util/symbol.h          |  13 +
> Â30 files changed, 1584 insertions(+), 204 deletions(-)
>
> --
> 1.7.4.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/