Re: [PATCH bpf-next v5 1/6] arm64: ftrace: Add ftrace direct call support

From: Xu Kuohai
Date: Wed Aug 10 2022 - 04:10:41 EST


On 8/10/2022 1:03 AM, Florent Revest wrote:
On Thu, Jun 9, 2022 at 6:27 AM Xu Kuohai <xukuohai@xxxxxxxxxx> wrote:
On 6/7/2022 12:35 AM, Mark Rutland wrote:
On Thu, May 26, 2022 at 10:48:05PM +0800, Xu Kuohai wrote:
On 5/26/2022 6:06 PM, Mark Rutland wrote:
On Thu, May 26, 2022 at 05:45:03PM +0800, Xu Kuohai wrote:
On 5/25/2022 9:38 PM, Mark Rutland wrote:
On Wed, May 18, 2022 at 09:16:33AM -0400, Xu Kuohai wrote:
As noted in that thread, I have a few concerns which equally apply here:

* Due to the limited range of BL instructions, it's not always possible to
patch an ftrace call-site to branch to an arbitrary trampoline. The way this
works for ftrace today relies upon knowingthe set of trampolines at
compile-time, and allocating module PLTs for those, and that approach cannot
work reliably for dynanically allocated trampolines.

Currently patch 5 returns -ENOTSUPP when long jump is detected, so no
bpf trampoline is constructed for out of range patch-site:

if (is_long_jump(orig_call, image))
return -ENOTSUPP;

Sure, my point is that in practice that means that (from the user's PoV) this
may randomly fail to work, and I'd like something that we can ensure works
consistently.


OK, should I suspend this work until you finish refactoring ftrace?

Yes; I'd appreciate if we could hold on this for a bit.

I think with some ground work we can avoid most of the painful edge cases and
might be able to avoid the need for custom trampolines.


I'v read your WIP code, but unfortunately I didn't find any mechanism to
replace bpf trampoline in your code, sorry.

It looks like bpf trampoline and ftrace works can be done at the same
time. I think for now we can just attach bpf trampoline to bpf prog.
Once your ftrace work is done, we can add support for attaching bpf
trampoline to regular kernel function. Is this OK?

Hey Mark and Xu! :)

I'm interested in this feature too and would be happy to help.

I've been trying to understand what you both have in mind to figure out a way
forward, please correct me if I got anything wrong! :)


It looks like, currently, there are three places where an indirection to BPF is
technically possible. Chronologically these are:

- the function's patchsite (currently there are 2 nops, this could become 4
nops with Mark's series on per call-site ops)

- the ftrace ops (currently called by iterating over a global list but could be
called more directly with Mark's series on per-call-site ops or by
dynamically generated branches with Wang's series on dynamic trampolines)

- a ftrace trampoline tail call (currently, this is after restoring a full
pt_regs but this could become an args only restoration with Mark's series on
DYNAMIC_FTRACE_WITH_ARGS)


If we first consider the situation when only a BPF program is attached to a
kernel function:
- Using the patchsite for indirection (proposed by Xu, same as on x86)
Pros:
- We have BPF trampolines anyway because they are required for orthogonal
features such as calling BPF programs as functions, so jumping into that
existing JITed code is straightforward
- This has the minimum overhead (eg: these trampolines only save the actual
number of args used by the function in ctx and avoid indirect calls)
Cons:
- If the BPF trampoline is JITed outside BL's limits, attachment can
randomly fail

- Using a ftrace op for indirection (proposed by Mark)
Pros:
- BPF doesn't need to care about BL's range, ftrace_caller will be in range
Cons:
- The ftrace trampoline would first save all args in an ftrace_regs only for
the BPF op to then re-save them in a BPF ctx array (as per BPF calling
convention) so we'd effectively have to do the work of saving args twice
- BPF currently uses DYNAMIC_FTRACE_WITH_DIRECT_CALLS APIs. Either arm64
should implement DIRECT_CALLS with... an indirect call :) (that is, the
arch_ftrace_set_direct_caller op would turn back its ftrace_regs into
arguments for the BPF trampoline) or BPF would need to use a different
ftrace API just on arm64 (to define new ops, which, unless if they would be
dynamically JITed, wouldn't be as performant as the existing BPF
trampolines)

- Using a ftrace trampoline tail call for indirection (not discussed yet iiuc)
Pros:
- BPF also doesn't need to care about BL's range
- This also leverages the existing BPF trampolines
Cons:
- This also does the work of saving/restoring arguments twice
- DYNAMIC_FTRACE_WITH_DIRECT_CALLS depends on DYNAMIC_FTRACE_WITH_REGS now
although in practice the registers kept by DYNAMIC_FTRACE_WITH_ARGS
should be enough to call BPF trampolines

If we consider the situation when both ftrace ops and BPF programs are attached
to a kernel function:
- Using the patchsite for indirection can't solve this

- Using a ftrace op for indirection (proposed by Mark) or using a ftrace
trampoline tail call as an indirection (proposed by Xu, same as on x86) have
the same pros & cons as in the BPF only situation except that this time we
pay the cost of registers saving twice for good reasons (we need args in both
ftrace_regs and the BPF ctx array formats anyway)


Unless I'm missing something, it sounds like the following approach would work:
- Always patch patchsites with calls to ftrace trampolines (within BL ranges)
- Always go through ops and have arch_ftrace_set_direct_caller set
ftrace_regs->direct_call (instead of pt_regs->orig_x0 in this patch)
- If ftrace_regs->direct_call != 0 at the end of the ftrace trampoline, tail
call it

Once Mark's series on DYNAMIC_FTRACE_WITH_ARGS is merged, we would need to have
DYNAMIC_FTRACE_WITH_DIRECT_CALLS
depend on DYNAMIC_FTRACE_WITH_REGS || DYNAMIC_FTRACE_WITH_ARGS
BPF trampolines (the only users of this API now) only care about args to the
attachment point anyway so I think this would work transparently ?

Once Mark's series on per-callsite ops is merged, the second step (going
through ops) would be significantly faster in the situation where only one
program is used, therefore one arch_ftrace_set_direct_caller op.

Once Wang's series on dynamic trampolines is merged, the second step (going
through ops) would also be significantly faster in the case when multiple ops
are attached.


What are your thoughts? If this sounds somewhat sane, I'm happy to help out
with the implementation as well :)


Hi Florent,

I'm struggling with how to attach bpf trampoline to regular kernel functions. I
think your suggestion is fine. Thanks for the help!

Thanks!
Florent
.