Re: LBR callchains from tracepoints

From: Arnaldo Carvalho de Melo
Date: Tue Apr 26 2016 - 14:20:43 EST


Em Tue, Apr 26, 2016 at 10:26:32AM -0700, Alexei Starovoitov escreveu:
> On Tue, Apr 26, 2016 at 06:38:28PM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 25, 2016 at 10:24:31PM -0300, Arnaldo Carvalho de Melo wrote:
> > > Em Mon, Apr 25, 2016 at 10:03:58PM -0300, Arnaldo Carvalho de Melo escreveu:
> > > > I now need to continue investigation why this doesn't seem to work from
> > > > tracepoints...

> > > Bummer, the changeset (at the end of this message) hasn't any
> > > explanation, is this really impossible? I.e. LBR callstacks from
> > > tracepoints? Even if we set perf_event_attr.exclude_callchain_kernel?

> > Could maybe be done, but its tricky to implement as the LBR is managed
> > by the hardware PMU and tracepoints are a software PMU, so we need to
> > then somehow frob with cross-pmu resources, in a very arch specific way.
> > And programmability of the hardware PMU will then depend on events
> > outside of it.

> btw we're thinking to add support for lbr to bpf, so that from the program
> we can get accurate and fast stacks. That's especially important for user
> space stacks. No clear idea how to do it yet, but it would be really useful.

Yeah, and that already works in perf, its just that it doesn't work from some
points (PERF_TYPE_SOFTWARE, PERF_TYPE_TRACEPOINT, etc), as described in the
changeset I mentioned.

'perf trace --call-graph lbr' doesn't work right now even with it
interested only in the user space bits, i.e. setting
perf_event_attr.exclude_callchain_kernel.

# perf trace --call-graph dwarf

works, but that, as you mention, really isn't "fast" and sometimes not
accurate, or at least wasn't with broken toolchains.

Example of mixed strace-like with userspace-only DWARF callchains (would be
lovely if this was with LBR, huh?) plus fp callchains for the
sched:sched_switch tracepoint plus LBR callchains for a hardware event, cycles,
look further below for the reason of the broken timestamps for
PERF_TYPE_HARDWARE events:

# perf trace -e nanosleep --event sched:sched_switch/call-graph=fp/ --ev cycles/call-graph=lbr,period=100/ usleep 1
18446744073709.551 ( ): cycles/call-graph=lbr,period=100/:)
__intel_pmu_enable_all+0xfe200080 ([kernel.kallsyms])
intel_pmu_enable_all+0xfe200010 ([kernel.kallsyms])
x86_pmu_enable+0xfe200271 ([kernel.kallsyms])
perf_pmu_enable.part.81+0xfe200007 ([kernel.kallsyms])
ctx_resched+0xfe20007a ([kernel.kallsyms])
perf_event_exec+0xfe20011d ([kernel.kallsyms])
setup_new_exec+0xfe20006f ([kernel.kallsyms])
load_elf_binary+0xfe2003e3 ([kernel.kallsyms])
search_binary_handler+0xfe20009e ([kernel.kallsyms])
do_execveat_common.isra.38+0xfe20052c ([kernel.kallsyms])
sys_execve+0xfe20003a ([kernel.kallsyms])
do_syscall_64+0xfe200062 ([kernel.kallsyms])
return_from_SYSCALL_64+0xfe200000 ([kernel.kallsyms])
[0] ([unknown])
0.310 ( 0.006 ms): usleep/20951 nanosleep(rqtp: 0x7ffda8904500 ) ...
0.310 ( ): sched:sched_switch:usleep:20951 [120] S ==> swapper/3:0 [120])
__schedule+0xfe200402 ([kernel.kallsyms])
schedule+0xfe200035 ([kernel.kallsyms])
do_nanosleep+0xfe20006f ([kernel.kallsyms])
hrtimer_nanosleep+0xfe2000dc ([kernel.kallsyms])
sys_nanosleep+0xfe20007a ([kernel.kallsyms])
do_syscall_64+0xfe200062 ([kernel.kallsyms])
return_from_SYSCALL_64+0xfe200000 ([kernel.kallsyms])
__nanosleep+0xffff00bfad62c010 (/usr/lib64/libc-2.22.so)
18446679523046.461 ( ): cycles/call-graph=lbr,period=100/:)
perf_pmu_enable.part.81+0xfe200007 ([kernel.kallsyms])
__perf_event_task_sched_in+0xfe2001ad ([kernel.kallsyms])
finish_task_switch+0xfe200156 ([kernel.kallsyms])
__schedule+0xfe200397 ([kernel.kallsyms])
schedule+0xfe200035 ([kernel.kallsyms])
do_nanosleep+0xfe20006f ([kernel.kallsyms])
hrtimer_nanosleep+0xfe2000dc ([kernel.kallsyms])
sys_nanosleep+0xfe20007a ([kernel.kallsyms])
do_syscall_64+0xfe200062 ([kernel.kallsyms])
return_from_SYSCALL_64+0xfe200000 ([kernel.kallsyms])
[0] ([unknown])
0.377 ( 0.073 ms): usleep/20951 ... [continued]: nanosleep()) = 0
[root@jouet ~]#


perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
config 0 (PERF_COUNT_HW_CPU_CYCLES)
size 112
{ sample_period, sample_freq } 100
sample_type IP|TID|CALLCHAIN|BRANCH_STACK|IDENTIFIER
read_format ID
disabled 1
inherit 1
enable_on_exec 1
sample_id_all 1
exclude_guest 1
{ wakeup_events, wakeup_watermark } 1
branch_sample_type USER|CALL_STACK|NO_FLAGS|NO_CYCLES

missing PERF_SAMPLE_TIME, will fix.

- Arnaldo