Re: [RFC 0/1] BPF tracing for arm64 using fprobe

From: Alexei Starovoitov
Date: Thu Nov 17 2022 - 11:50:33 EST

Next message: Jonathan Cameron: "Re: [PATCH 1/2] dt-bindings: iio: magnetometer: add ti tmag5273 documentation file"
Previous message: Uwe Kleine-König: "Re: [PATCH v12 1/2] pwm: add microchip soft ip corePWM driver"
In reply to: Google: "Re: [RFC 0/1] BPF tracing for arm64 using fprobe"
Next in thread: Mark Rutland: "Re: [RFC 0/1] BPF tracing for arm64 using fprobe"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Nov 17, 2022 at 5:34 AM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
>
> On Wed, 16 Nov 2022 18:41:26 -0800
> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:
>
> > On Tue, Nov 8, 2022 at 2:07 PM Florent Revest <revest@xxxxxxxxxxxx> wrote:
> > >
> > > Hi!
> > >
> > > With this RFC, I'd like to revive the conversation between BPF, ARM and tracing
> > > folks on what BPF tracing (fentry/fexit/fmod_ret) could/should look like on
> > > arm64.
> > >
> > > Current status of BPF tracing
> > > =============================
> > >
> > > On currently supported architectures (like x86), BPF tracing programs are
> > > called from a JITted BPF trampoline, itself called from the ftrace patch site
> > > thanks to the ftrace "direct call" API. (or from the end of the ftrace
> > > trampoline if a ftrace ops is also tracing that function, but this is
> > > transparent to BPF)
> > >
> > > Thanks to Xu's work [1], we now have BPF trampolines on arm64 (these can be
> > > used for struct ops programs already), but Xu's attempts at getting ftrace
> > > direct calls support [2][3] on arm64 have been unsucessful so far so we still
> > > do not support BPF tracing programs. This prompted me to try a different
> > > approach. I'd like to collect feedback on it here.
> > >
> > > Why not direct calls ?
> > > ======================
> > >
> > > Mark and Steven have not been too keen on getting direct calls on arm64 because:
> > > - working around BL instruction's limited range introduces complexity [4]
> > > - it's difficult to get reliable stacktraces right with direct calls [5]
> > > - direct calls are complex to maintain on the arch/ftrace side [5]
> > >
> > > In the absence of ftrace direct calls support, BPF tracing programs would need
> > > to be called from an ftrace ops instead. Note that the BPF callback signature
> > > would have to be different, so we can't re-use trampolines (direct called
> > > callbacks receive arguments in registers whereas ftrace ops callbacks receive
> > > arguments in a struct ftrace_regs pointer)
> > >
> > > Why fprobe ?
> > > ============
> > >
> > > Ftrace ops per-se only expose an API to hook before a function. There are two
> > > systems built on top of ftrace ops that also allow hooking the function exit:
> > > fprobe (using rethook) and the function graph tracer. There are plans from
> > > Masami and Steven to unify these two systems but, as they stand, only fprobe
> > > gives enough flexibility to implement BPF tracing.
> > >
> > > In order not to reinvent the wheel, if direct calls aren't available on the
> > > arch, BPF could leverage fprobe to hook before and after the traced function.
> > > Note that return hooking is implemented a bit differently than it is in BPF
> > > trampolines. Instead of keeping arguments on a stack frame and calling the
> > > traced function, rethook saves arguments in a memory pool and returns to the
> > > traced function with a hijacked return pointer that will have its ret jump back
> > > to the rethook trampoline.
> > >
> > > What about performances ?
> > > =========================
> > >
> > > In its current state, a fprobe callback on arm64 is very expensive because:
> > > 1- the ftrace trampoline saves all registers (including many unnecessary ones)
> > > 2- it calls ftrace_ops_list_func which iterates over all ops and is very slow
> > > 3- the fprobe ops unconditionally hooks a rethook
> > > 4- rethook grabs memory from a freelist which is slow under high contention
> > >
> > > However, all the above points are currently being addressed:
> > > 1- by Mark's series to save argument registers only [6]
> > > 2- by Mark's series to call single ops directly [7]
> > > 3- by Masami's patch to skip rethooks if not needed [8]
> > > 4- Masami said the rethook freelist would be replaced by a per-task stack as
> > > part of its unification with the function graph tracer [9]
> > >
> > > I measured the costs of BPF on different approaches on my RPi4 here: [10]
> > > tl;dr: the BPF "bench" takes a performance hit of:
> > > - 28.6% w/ BPF tracing on direct calls (best case scenario for reference) [11]
> > > - 66.8% w/ BPF on kprobe (just for reference)
> > > - 62.6% w/ BPF tracing on fprobe without any optimizations (current state) [12]
> > > - 34.1% w/ BPF tracing on fprobe with all optimizations (near-future state) [13]
> >
> > Even with all optimization the performance overhead is not acceptable.
> > It feels to me that folks are still thinking about bpf trampoline
> > as a tracing facility.
> > It's a lot more than that. It needs to run 24/7 with zero overhead.
> > It needs to replace the kernel functions and be invoked
> > millions times a second until the system is rebooted.
> > In this environment every nanosecond counts.
> >
> > Even if the fprobe side was completely free the patch 1 has so much
> > overhead in copy of bpf_cookie, regs, etc that it's a non-starter
> > for these use cases.
> >
> > There are several other fundamental issues in this approach
> > because of fprobe/ftrace.
> > It has ftrace_test_recursion_trylock and disables preemption.
> > Both are deal breakers.
>
> I talked with Florent about this offline.
> ftrace_test_recursion_trylock() is required for generic ftrace
> use because user callback can call a function which can be
> traced by ftrace. This means it can cause an infinite loop.
> However, if user can ensure to check it by itself, I can add a
> flag to avoid that trylock. (Of course, you can shoot your foot.)
>
> I thought the preemption disabling was for accessing per-cpu,
> but it is needed for rethook to get an object from an RCU
> protected list.
> Thus when we move on the per-task shadow stack, it can be
> removed too.

There might not be a task available where bpf trampoline is running.
rcu protection might not be there either.
Really we're just scratching the surface of all the issues why
fprobe is not usable.

Next message: Jonathan Cameron: "Re: [PATCH 1/2] dt-bindings: iio: magnetometer: add ti tmag5273 documentation file"
Previous message: Uwe Kleine-König: "Re: [PATCH v12 1/2] pwm: add microchip soft ip corePWM driver"
In reply to: Google: "Re: [RFC 0/1] BPF tracing for arm64 using fprobe"
Next in thread: Mark Rutland: "Re: [RFC 0/1] BPF tracing for arm64 using fprobe"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]