Re: [PATCH 1/1] perf trace: Use the augmented_raw_syscall BPF skel only for tracing syscalls

From: Ian Rogers
Date: Thu Aug 17 2023 - 11:45:19 EST


On Thu, Aug 17, 2023 at 8:37 AM Arnaldo Carvalho de Melo
<acme@xxxxxxxxxx> wrote:
>
> It is possible to use 'perf trace' with tracepoints and in that case we
> can't initialize/use the augmented_raw_syscalls BPF skel.
>
> For instance, this usecase:
>
> # perf trace -e sched:*exec --max-events=5
> ? ( ): NetworkManager/1183 ... [continued]: poll()) = 1
> 0.043 ( 0.007 ms): NetworkManager/1183 epoll_wait(epfd: 17<anon_inode:[eventpoll]>, events: 0x55555f90e920, maxevents: 6) = 0
> 0.060 ( 0.007 ms): NetworkManager/1183 write(fd: 3<anon_inode:[eventfd]>, buf: 0x7ffc5a27cd30, count: 8) = 8
> 0.073 ( 0.005 ms): NetworkManager/1183 epoll_wait(epfd: 24<anon_inode:[eventpoll]>, events: 0x7ffc5a27cd20, maxevents: 2) = 1
> 0.082 ( 0.010 ms): NetworkManager/1183 recvmmsg(fd: 26<socket:[30298]>, mmsg: 0x7ffc5a27caa0, vlen: 8) = 1
> #
>
> Where we want to trace just some sched tracepoints ending in 'exec' ends
> up tracing all syscalls.
>
> Fix it by checking existing trace->trace_syscalls boolean to see if we
> need the augmenter.
>
> A followup patch will move those sections of code used only with the
> augmenter to separate functions, to get it cleaner and remove the goto,
> done just for reviewing purposes.
>
> With this patch in place the previous behaviour is restored: no syscalls
> when we have other events and no syscall names:
>
> [root@quaco ~]# perf probe do_filp_open "filename=pathname->name:string"
> Added new event:
> probe:do_filp_open (on do_filp_open with filename=pathname->name:string)
>
> You can now use it in all perf tools, such as:
>
> perf record -e probe:do_filp_open -aR sleep 1
>
> [root@quaco ~]# perf trace --max-events=10 -e probe:do_filp_open sleep 1
> 0.000 sleep/455122 probe:do_filp_open(__probe_ip: -1186560412, filename: "/etc/ld.so.cache")
> 0.056 sleep/455122 probe:do_filp_open(__probe_ip: -1186560412, filename: "/lib64/libc.so.6")
> 0.481 sleep/455122 probe:do_filp_open(__probe_ip: -1186560412, filename: "/usr/lib/locale/locale-archive")
> 0.501 sleep/455122 probe:do_filp_open(__probe_ip: -1186560412, filename: "/usr/share/locale/locale.alias")
> 0.572 sleep/455122 probe:do_filp_open(__probe_ip: -1186560412, filename: "/usr/lib/locale/en_US.UTF-8/LC_IDENTIFICATION")
> 0.581 sleep/455122 probe:do_filp_open(__probe_ip: -1186560412, filename: "/usr/lib/locale/en_US.utf8/LC_IDENTIFICATION")
> 0.616 sleep/455122 probe:do_filp_open(__probe_ip: -1186560412, filename: "/usr/lib64/gconv/gconv-modules.cache")
> 0.656 sleep/455122 probe:do_filp_open(__probe_ip: -1186560412, filename: "/usr/lib/locale/en_US.UTF-8/LC_MEASUREMENT")
> 0.664 sleep/455122 probe:do_filp_open(__probe_ip: -1186560412, filename: "/usr/lib/locale/en_US.utf8/LC_MEASUREMENT")
> 0.696 sleep/455122 probe:do_filp_open(__probe_ip: -1186560412, filename: "/usr/lib/locale/en_US.UTF-8/LC_TELEPHONE")
> [root@quaco ~]#
>
> As well as mixing syscalls with tracepoints, getting the syscall
> tracepoints used augmented using the BPF skel:
>
> [root@quaco ~]# perf trace --max-events=10 -e open*,probe:do_filp_open sleep 1
> 0.000 ( ): sleep/455124 openat(dfd: CWD, filename: "/etc/ld.so.cache", flags: RDONLY|CLOEXEC) ...
> 0.005 ( ): sleep/455124 probe:do_filp_open(__probe_ip: -1186560412, filename: "/etc/ld.so.cache")
> 0.000 ( 0.011 ms): sleep/455124 ... [continued]: openat()) = 3
> 0.031 ( ): sleep/455124 openat(dfd: CWD, filename: "/lib64/libc.so.6", flags: RDONLY|CLOEXEC) ...
> 0.033 ( ): sleep/455124 probe:do_filp_open(__probe_ip: -1186560412, filename: "/lib64/libc.so.6")
> 0.031 ( 0.006 ms): sleep/455124 ... [continued]: openat()) = 3
> 0.258 ( ): sleep/455124 openat(dfd: CWD, filename: "/usr/lib/locale/locale-archive", flags: RDONLY|CLOEXEC) ...
> 0.261 ( ): sleep/455124 probe:do_filp_open(__probe_ip: -1186560412, filename: "/usr/lib/locale/locale-archive")
> 0.258 ( 0.006 ms): sleep/455124 ... [continued]: openat()) = -1 ENOENT (No such file or directory)
> 0.272 ( ): sleep/455124 openat(dfd: CWD, filename: "/usr/share/locale/locale.alias", flags: RDONLY|CLOEXEC) ...
> 0.273 ( ): sleep/455124 probe:do_filp_open(__probe_ip: -1186560412, filename: "/usr/share/locale/locale.alias")
>
> A final note: the probe:do_filp_open uses a kprobe (probably optimized
> as its in the start of a function) that uses the kprobe_tracer mechanism
> in the kernel to collect the pathname->name string and stash it into the
> tracepoint created by 'perf probe' for that:
>
> [root@quaco ~]# cat /sys/kernel/debug/tracing/kprobe_events
> p:probe/do_filp_open _text+4621920 filename=+0(+0(%si)):string
> [root@quaco ~]#
>
> While the syscalls:sys_enter_openat tracepoint gets its string from a
> BPF program attached to raw_syscalls:sys_enter that tail calls into
> another BPF program that knows the types for the openat syscall args and
> thus can bpf_probe_read it right after the normal
> sys_enter/sys_enter_openat tracepoint payload that comes prefixed with
> whatever perf_event_open asked for (CPU, timestamp, etc):
>
> [root@quaco ~]# bpftool prog | grep -E "sys_enter |sys_enter_opena" -A3
> 3176: tracepoint name sys_enter tag 0bc3fc9d11754ba1 gpl
> loaded_at 2023-08-17T12:32:20-0300 uid 0
> xlated 272B jited 257B memlock 4096B map_ids 2462,2466,2463
> btf_id 2976
> --
> 3180: tracepoint name sys_enter_opena tag 19dd077f00ec2f58 gpl
> loaded_at 2023-08-17T12:32:20-0300 uid 0
> xlated 328B jited 206B memlock 4096B map_ids 2466,2465
> btf_id 2976
> [root@quaco ~]#
>
> Fixes: 42963c8bedeb864b ("perf trace: Migrate BPF augmentation to use a skeleton")
> Cc: Adrian Hunter <adrian.hunter@xxxxxxxxx>
> Cc: Alexander Shishkin <alexander.shishkin@xxxxxxxxxxxxxxx>
> Cc: Andi Kleen <ak@xxxxxxxxxxxxxxx>
> Cc: Andrii Nakryiko <andrii@xxxxxxxxxx>
> Cc: Anshuman Khandual <anshuman.khandual@xxxxxxx>
> Cc: Athira Jajeev <atrajeev@xxxxxxxxxxxxxxxxxx>
> Cc: bpf@xxxxxxxxxxxxxxx
> Cc: Brendan Gregg <brendan.d.gregg@xxxxxxxxx>
> Cc: Carsten Haitzler <carsten.haitzler@xxxxxxx>
> Cc: Eduard Zingerman <eddyz87@xxxxxxxxx>
> Cc: Fangrui Song <maskray@xxxxxxxxxx>
> Cc: He Kuang <hekuang@xxxxxxxxxx>
> Cc: Ian Rogers <irogers@xxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: James Clark <james.clark@xxxxxxx>
> Cc: Jiri Olsa <jolsa@xxxxxxxxxx>
> Cc: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
> Cc: Leo Yan <leo.yan@xxxxxxxxxx>
> Cc: llvm@xxxxxxxxxxxxxxx
> Cc: Madhavan Srinivasan <maddy@xxxxxxxxxxxxx>
> Cc: Mark Rutland <mark.rutland@xxxxxxx>
> Cc: Namhyung Kim <namhyung@xxxxxxxxxx>
> Cc: Nathan Chancellor <nathan@xxxxxxxxxx>
> Cc: Naveen N. Rao <naveen.n.rao@xxxxxxxxxxxxxxxxxx>
> Cc: Nick Desaulniers <ndesaulniers@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Ravi Bangoria <ravi.bangoria@xxxxxxx>
> Cc: Rob Herring <robh@xxxxxxxxxx>
> Cc: Tiezhu Yang <yangtiezhu@xxxxxxxxxxx>
> Cc: Tom Rix <trix@xxxxxxxxxx>
> Cc: Wang Nan <wangnan0@xxxxxxxxxx>
> Cc: Wang ShaoBo <bobo.shaobowang@xxxxxxxxxx>
> Cc: Yang Jihong <yangjihong1@xxxxxxxxxx>
> Cc: Yonghong Song <yhs@xxxxxx>
> Cc: YueHaibing <yuehaibing@xxxxxxxxxx>
> Link: https://lore.kernel.org/lkml/
> Signed-off-by: Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>

Reviewed-by: Ian Rogers <irogers@xxxxxxxxxx>

Thanks,
Ian

> ---
> tools/perf/builtin-trace.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> index 0ebfa95895e0bf4d..3964cf44cdbcb3e8 100644
> --- a/tools/perf/builtin-trace.c
> +++ b/tools/perf/builtin-trace.c
> @@ -3895,7 +3895,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
> if (err < 0)
> goto out_error_open;
> #ifdef HAVE_BPF_SKEL
> - {
> + if (trace->syscalls.events.bpf_output) {
> struct perf_cpu cpu;
>
> /*
> @@ -3916,7 +3916,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
> goto out_error_mem;
>
> #ifdef HAVE_BPF_SKEL
> - if (trace->skel->progs.sys_enter)
> + if (trace->skel && trace->skel->progs.sys_enter)
> trace__init_syscalls_bpf_prog_array_maps(trace);
> #endif
>
> @@ -4850,6 +4850,9 @@ int cmd_trace(int argc, const char **argv)
> }
>
> #ifdef HAVE_BPF_SKEL
> + if (!trace.trace_syscalls)
> + goto skip_augmentation;
> +
> trace.skel = augmented_raw_syscalls_bpf__open();
> if (!trace.skel) {
> pr_debug("Failed to open augmented syscalls BPF skeleton");
> @@ -4884,6 +4887,7 @@ int cmd_trace(int argc, const char **argv)
> }
> trace.syscalls.events.bpf_output = evlist__last(trace.evlist);
> assert(!strcmp(evsel__name(trace.syscalls.events.bpf_output), "__augmented_syscalls__"));
> +skip_augmentation:
> #endif
> err = -1;
>
> --
> 2.41.0
>