Re: [PATCH -next V7 0/7] riscv: Optimize function trace

From: Guo Ren
Date: Tue Feb 07 2023 - 21:31:24 EST


Hi Mark,

Thx for the thoughtful reply.

On Tue, Feb 7, 2023 at 5:17 PM Mark Rutland <mark.rutland@xxxxxxx> wrote:
>
> On Tue, Feb 07, 2023 at 11:57:06AM +0800, Guo Ren wrote:
> > On Mon, Feb 6, 2023 at 5:56 PM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> > > The DYNAMIC_FTRACE_WITH_CALL_OPS patches should be in v6.3. They're currently
> > > queued in the arm64 tree in the for-next/ftrace branch:
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/ftrace
> > > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/
> > >
> > > ... and those *should* be in v6.3.
> > Glade to hear that. Great!
> >
> > >
> > > Patches to imeplement DIRECT_CALLS atop that are in review at the moment:
> > >
> > > https://lore.kernel.org/linux-arm-kernel/20230201163420.1579014-1-revest@xxxxxxxxxxxx/
> > Good reference. Thx for sharing.
> >
> > >
> > > ... and if riscv uses the CALL_OPS approach, I believe it can do much the same
> > > there.
> > >
> > > If riscv wants to do a single atomic patch to each patch-site (to avoid
> > > stop_machine()), then direct calls would always needs to bounce through the
> > > ftrace_caller trampoline (and acquire the direct call from the ftrace_ops), but
> > > that might not be as bad as it sounds -- from benchmarking on arm64, the bulk
> > > of the overhead seen with direct calls is when using the list_ops or having to
> > > do a hash lookup, and both of those are avoided with the CALL_OPS approach.
> > > Calling directly from the patch-site is a minor optimization relative to
> > > skipping that work.
> > Yes, CALL_OPS could solve the PREEMPTION & stop_machine problems. I
> > would follow up.
> >
> > The difference from arm64 is that RISC-V is 16bit/32bit mixed
> > instruction ISA, so we must keep ftrace_caller & ftrace_regs_caller in
> > 2048 aligned. Then:
>
> Where does the 2048-bit alignment requirement come from?
Sorry for the typo. It's one 2048 bytes for keeping two trampolines
(ftrace_caller & ftrace_regs_caller) in one aligned part.
Because the jalr has only +-2048 bytes offset range.

Then the "auipc t1, ftrace(_regs)_caller" is fixed.

>
> Note that I'm assuming you will *always* go through a common ftrace_caller
> trampoline (even for direct calls), with the trampoline responsible for
> recovering the direct trampoline (or ops->func) from the ops pointer.
>
> That would only require 64-bit alignment on 64-bit (or 32-bit alignment on
> 32-bit) to keep the literal naturally-aligned; the rest of the instructions
> wouldn't require additional alignment.
>
> For example, I would expect that (for 64-bit) you'd use:
>
> # place 2 NOPs *immediately before* the function, and 3 NOPs at the start
> -fpatchable-function-entry=5,2
>
> # Align the function to 8-bytes
> -falign=functions=8
>
> ... and your trampoline in each function could be initialized to:
>
> # Note: aligned to 8 bytes
> addr-08 // Literal (first 32-bits) // set to ftrace_nop_ops
> addr-04 // Literal (last 32-bits) // set to ftrace_nop_ops
> addr+00 func: mv t0, ra
> addr+04 auipc t1, ftrace_caller
> addr+08 nop
>
> ... and when enabled can be set to:
>
> # Note: aligned to 8 bytes
> addr-08 // Literal (first 32-bits) // patched to ops ptr
> addr-04 // Literal (last 32-bits) // patched to ops ptr
> addr+00 func: mv t0, ra
We needn't "mv t0, ra" here because our "jalr" could work with t0 and
won't affect ra. Let's do it in the trampoline code, and then we can
save another word here.
> addr+04 auipc t1, ftrace_caller
> addr+08 jalr ftrace_caller(t1)

Here is the call-site:
# Note: aligned to 8 bytes
addr-08 // Literal (first 32-bits) // patched to ops ptr
addr-04 // Literal (last 32-bits) // patched to ops ptr
addr+00 auipc t0, ftrace_caller
addr+04 jalr ftrace_caller(t0)

>
> Note: this *only* requires patching the literal and NOP<->JALR; the MV and
> AUIPC aren't harmful and can always be there. This way, you won't need to use
> stop_machine().
Yes, simplest nop is better than c.j. I confused.

>
> With that, the ftrace_caller trampoline can recover the `ops` pointer at a
> negative offset from `ra`, and can recover the instrumented function's return
> address in `t0`. Using the `ops` pointer, it can figure out whether to branch
> to a direct trampoline or whether to save/restore the regs around invoking
> ops->func.
>
> For 32-bit it would be exactly the same, except you'd only need a single nop
> before the function, and the offset would be -0x10.
Yes, we reduced another 4 bytes & a smaller alignment for better code
size when 32-bit.
# Note: aligned to 4 bytes
addr-04 // Literal (last 32-bits) // patched to ops ptr
addr+00 auipc t0, ftrace_caller
addr+04 jalr ftrace_caller(t0)
>
> That's what arm64 does; the only difference is that riscv would *always* need
> to go via the trampoline in order to make direct calls.
We need one more trampoline here beside ftrace_caller &
ftrace_regs_caller: It's "direct_caller".

addr+04 nop -> direct_caller/ftrace_caller/ftrace_regs_caller

>
> Thanks,
> Mark.



--
Best Regards
Guo Ren