Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

From: Andy Chiu
Date: Fri Mar 08 2024 - 04:19:51 EST


Hi Puranjay,

On Fri, Mar 8, 2024 at 3:53 AM Puranjay Mohan <puranjay12@xxxxxxxxx> wrote:
>
> Hi Björn,
>
> On Thu, Mar 7, 2024 at 8:27 PM Björn Töpel <bjorn@xxxxxxxxxx> wrote:
> >
> > Puranjay!
> >
> > Puranjay Mohan <puranjay12@xxxxxxxxx> writes:
> >
> > > This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V.
> > > This allows each ftrace callsite to provide an ftrace_ops to the common
> > > ftrace trampoline, allowing each callsite to invoke distinct tracer
> > > functions without the need to fall back to list processing or to
> > > allocate custom trampolines for each callsite. This significantly speeds
> > > up cases where multiple distinct trace functions are used and callsites
> > > are mostly traced by a single tracer.
> > >
> > > The idea and most of the implementation is taken from the ARM64's
> > > implementation of the same feature. The idea is to place a pointer to
> > > the ftrace_ops as a literal at a fixed offset from the function entry
> > > point, which can be recovered by the common ftrace trampoline.
> >
> > Not really a review, but some more background; Another rationale (on-top
> > of the improved per-call performance!) for CALL_OPS was to use it to
> > build ftrace direct call support (which BPF uses a lot!). Mark, please
> > correct me if I'm lying here!
> >
> > On Arm64, CALL_OPS makes it possible to implement direct calls, while
> > only patching one BL instruction -- nice!
> >
> > On RISC-V we cannot use use the same ideas as Arm64 straight off,
> > because the range of jal (compare to BL) is simply too short (+/-1M).
> > So, on RISC-V we need to use a full auipc/jal pair (the text patching
> > story is another chapter, but let's leave that aside for now). Since we
> > have to patch multiple instructions, the cmodx situation doesn't really
> > improve with CALL_OPS.
> >
> > Let's say that we continue building on your patch and implement direct
> > calls on CALL_OPS for RISC-V as well.
> >
> > From Florent's commit message for direct calls:
> >
> > | There are a few cases to distinguish:
> > | - If a direct call ops is the only one tracing a function:
> > | - If the direct called trampoline is within the reach of a BL
> > | instruction
> > | -> the ftrace patchsite jumps to the trampoline
> > | - Else
> > | -> the ftrace patchsite jumps to the ftrace_caller trampoline which
> > | reads the ops pointer in the patchsite and jumps to the direct
> > | call address stored in the ops
> > | - Else
> > | -> the ftrace patchsite jumps to the ftrace_caller trampoline and its
> > | ops literal points to ftrace_list_ops so it iterates over all
> > | registered ftrace ops, including the direct call ops and calls its
> > | call_direct_funcs handler which stores the direct called
> > | trampoline's address in the ftrace_regs and the ftrace_caller
> > | trampoline will return to that address instead of returning to the
> > | traced function
> >
> > On RISC-V, where auipc/jalr is used, the direct called trampoline would
> > always be reachable, and then first Else-clause would never be entered.
> > This means the the performance for direct calls would be the same as the
> > one we have today (i.e. no regression!).
> >
> > RISC-V does like x86 does (-ish) -- patch multiple instructions, long
> > reach.
> >
> > Arm64 uses CALL_OPS and patch one instruction BL.
> >
> > Now, with this background in mind, compared to what we have today,
> > CALL_OPS would give us (again assuming we're using it for direct calls):
> >
> > * Better performance for tracer per-call (faster ops lookup) GOOD
>
> ^ this was the only motivation for me to implement this patch.
>
> I don't think implementing direct calls over call ops is fruitful for
> RISC-V because once
> the auipc/jalr can be patched atomically, the direct call trampoline
> is always reachable.

Yes, the auipc/jalr instruction pair can be patched atomically just as
long as their size is naturally aligned on. However, we cannot prevent
2 instructions from being fetched atomically :P

> Solving the atomic text patching problem would be fun!! I am eager to
> see how it will be
> solved.

I have a patch series to solve the atomic code patching issue, which I
am about to respin [1]. The idea is to solve it with yet another layer
of indirection. We add a 8-B aligned space at each function entry. The
space is a pointer to the ftrace entry. During boot, each function
entry code is updated to perform a load and then take the jump from
the 8-B space. When ftrace is disabled, we patch the first 4B-aligned
instruction to a jump so as to skip the ftrace entry.

We are still discussing with Alex to see if we have a better way to do
it. If not then I'd update the patchset and re-send it. There's a
pending improvement in the series to reduce complexity. The 8-B
aligned space can be added before the function entry (just like your
patch).

>
> > * Larger text size (function alignment + extra nops) BAD
> > * Same direct call performance NEUTRAL
> > * Same complicated text patching required NEUTRAL
> >
> > It would be interesting to see how the per-call performance would
> > improve on x86 with CALL_OPS! ;-)
>
> If I remember from Steven's talk, x86 uses dynamically allocated trampolines
> for per callsite tracers, would CALL_OPS provide better performance than that?
>
> >
> > I'm trying to wrap my head if it makes sense to have it on RISC-V, given
> > that we're a bit different from Arm64. Does the scale tip to the GOOD
> > side?
> >
> > Oh, and we really need to see performance numbers on real HW! I have a
> > VF2 that I could try this series on.
>
> It would be great if you can do it :D.
>
> Thanks,
> Puranjay
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@xxxxxxxxxxxxxxxxxxx
> http://lists.infradead.org/mailman/listinfo/linux-riscv

- [1] https://yhbt.net/lore/all/CAJF2gTSn3_cDYsF9D8dt-br2Wf_M8y02A09xgRq8kXi91sN3Aw@xxxxxxxxxxxxxx/T/

Regards,
Andy