Re: BPF skels in perf .Re: [GIT PULL] perf tools changes for v6.4

From: Namhyung Kim
Date: Thu May 04 2023 - 18:46:50 EST

Next message: Jason Gunthorpe: "Re: [PATCH 0/2] iommu: Make flush queues a proper capability"
Previous message: Dexuan Cui: "[PATCH] Drivers: hv: vmbus: Call hv_synic_free() if hv_synic_alloc() fails"
In reply to: Arnaldo Carvalho de Melo: "Re: [PATCH RFC/RFT] perf bpf skels: Stop using vmlinux.h generated from BTF, use subset of used structs + CO-RE. was Re: BPF skels in perf .Re: [GIT PULL] perf tools changes for v6.4"
Next in thread: pr-tracker-bot: "Re: [GIT PULL] perf tools changes for v6.4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, May 04, 2023 at 06:48:50PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, May 04, 2023 at 04:07:29PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Thu, May 04, 2023 at 11:50:07AM -0700, Andrii Nakryiko escreveu:
> > > On Thu, May 4, 2023 at 10:52 AM Arnaldo Carvalho de Melo <acme@xxxxxxxxxx> wrote:
> > > > Andrii, can you add some more information about the usage of vmlinux.h
> > > > instead of using kernel headers?
> >
> > > I'll just say that vmlinux.h is not a hard requirement to build BPF
> > > programs, it's more a convenience allowing easy access to definitions
> > > of both UAPI and kernel-internal structures for tracing needs and
> > > marking them relocatable using BPF CO-RE machinery. Lots of real-world
> > > applications just check-in pregenerated vmlinux.h to avoid build-time
> > > dependency on up-to-date host kernel and such.
> >
> > > If vmlinux.h generation and usage is causing issues, though, given
> > > that perf's BPF programs don't seem to be using many different kernel
> > > types, it might be a better option to just use UAPI headers for public
> > > kernel type definitions, and just define CO-RE-relocatable minimal
> > > definitions locally in perf's BPF code for the other types necessary.
> > > E.g., if perf needs only pid and tgid from task_struct, this would
> > > suffice:
> >
> > > struct task_struct {
> > > int pid;
> > > int tgid;
> > > } __attribute__((preserve_access_index));
> >
> > Yeah, that seems like a way better approach, no vmlinux involved, libbpf
> > CO-RE notices that task_struct changed from this two integers version
> > (of course) and does the relocation to where it is in the running kernel
> > by using /sys/kernel/btf/vmlinux.
>
> Doing it for one of the skels, build tested, runtime untested, but not
> using any vmlinux, BTF to help, not that bad, more verbose, but at least
> we state what are the fields we actually use, have those attribute
> documenting that those offsets will be recorded for future use, etc.
>
> Namhyung, can you please check that this works?

Yep, it works great!

$ sudo ./perf stat -a --bpf-counters --for-each-cgroup /,user.slice,system.slice sleep 1

Performance counter stats for 'system wide':

64,110.41 msec cpu-clock / # 64.004 CPUs utilized
15,787 context-switches / # 246.247 /sec
72 cpu-migrations / # 1.123 /sec
1,236 page-faults / # 19.279 /sec
848,608,137 cycles / # 0.013 GHz (83.23%)
106,928,070 stalled-cycles-frontend / # 12.60% frontend cycles idle (83.23%)
209,204,795 stalled-cycles-backend / # 24.65% backend cycles idle (83.23%)
645,183,025 instructions / # 0.76 insn per cycle
# 0.32 stalled cycles per insn (83.24%)
141,776,876 branches / # 2.211 M/sec (83.63%)
3,001,078 branch-misses / # 2.12% of all branches (83.44%)
66.67 msec cpu-clock user.slice # 0.067 CPUs utilized
695 context-switches user.slice # 10.424 K/sec
22 cpu-migrations user.slice # 329.966 /sec
1,202 page-faults user.slice # 18.028 K/sec
150,514,330 cycles user.slice # 2.257 GHz (90.17%)
13,504,605 stalled-cycles-frontend user.slice # 8.97% frontend cycles idle (69.71%)
38,859,376 stalled-cycles-backend user.slice # 25.82% backend cycles idle (95.28%)
189,382,145 instructions user.slice # 1.26 insn per cycle
# 0.21 stalled cycles per insn (88.92%)
36,019,878 branches user.slice # 540.242 M/sec (90.16%)
697,723 branch-misses user.slice # 1.94% of all branches (65.77%)
44.33 msec cpu-clock system.slice # 0.044 CPUs utilized
2,382 context-switches system.slice # 53.732 K/sec
42 cpu-migrations system.slice # 947.418 /sec
34 page-faults system.slice # 766.958 /sec
100,383,549 cycles system.slice # 2.264 GHz (87.27%)
10,165,225 stalled-cycles-frontend system.slice # 10.13% frontend cycles idle (71.73%)
29,964,682 stalled-cycles-backend system.slice # 29.85% backend cycles idle (84.94%)
101,210,743 instructions system.slice # 1.01 insn per cycle
# 0.30 stalled cycles per insn (80.68%)
19,893,831 branches system.slice # 448.757 M/sec (86.94%)
397,854 branch-misses system.slice # 2.00% of all branches (88.42%)

1.001667221 seconds time elapsed

Thanks,
Namhyung

> diff --git a/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
> index 6a438e0102c5a2cb..f376d162549ebd74 100644
> --- a/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
> +++ b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
> @@ -1,11 +1,40 @@
> // SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> // Copyright (c) 2021 Facebook
> // Copyright (c) 2021 Google
> -#include "vmlinux.h"
> +#include <linux/types.h>
> +#include <linux/bpf.h>
> #include <bpf/bpf_helpers.h>
> #include <bpf/bpf_tracing.h>
> #include <bpf/bpf_core_read.h>
>
> +// libbpf's CO-RE will take care of the relocations so that these fields match
> +// the layout of these structs in the kernel where this ends up running on.
> +
> +struct cgroup_subsys_state {
> + struct cgroup *cgroup;
> +} __attribute__((preserve_access_index));
> +
> +struct css_set {
> + struct cgroup_subsys_state *subsys[13];
> +} __attribute__((preserve_access_index));
> +
> +struct task_struct {
> + struct css_set *cgroups;
> +} __attribute__((preserve_access_index));
> +
> +struct kernfs_node {
> + __u64 id;
> +} __attribute__((preserve_access_index));
> +
> +struct cgroup {
> + struct kernfs_node *kn;
> + int level;
> +} __attribute__((preserve_access_index));
> +
> +enum cgroup_subsys_id {
> + perf_event_cgrp_id = 8,
> +};
> +
> #define MAX_LEVELS 10 // max cgroup hierarchy level: arbitrary
> #define MAX_EVENTS 32 // max events per cgroup: arbitrary
>
> @@ -52,7 +81,7 @@ struct cgroup___new {
> /* old kernel cgroup definition */
> struct cgroup___old {
> int level;
> - u64 ancestor_ids[];
> + __u64 ancestor_ids[];
> } __attribute__((preserve_access_index));
>
> const volatile __u32 num_events = 1;
>

Next message: Jason Gunthorpe: "Re: [PATCH 0/2] iommu: Make flush queues a proper capability"
Previous message: Dexuan Cui: "[PATCH] Drivers: hv: vmbus: Call hv_synic_free() if hv_synic_alloc() fails"
In reply to: Arnaldo Carvalho de Melo: "Re: [PATCH RFC/RFT] perf bpf skels: Stop using vmlinux.h generated from BTF, use subset of used structs + CO-RE. was Re: BPF skels in perf .Re: [GIT PULL] perf tools changes for v6.4"
Next in thread: pr-tracker-bot: "Re: [GIT PULL] perf tools changes for v6.4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]