Re: [RFC PATCH 6/9] powerpc/ftrace: Update and move function profile instructions out-of-line

From: Christophe Leroy
Date: Thu Dec 21 2023 - 05:46:23 EST




Le 08/12/2023 à 17:30, Naveen N Rao a écrit :
> Function profile sequence on powerpc includes two instructions at the
> beginning of each function:
>
> mflr r0
> bl ftrace_caller
>
> The call to ftrace_caller() gets nop'ed out during kernel boot and is
> patched in when ftrace is enabled.
>
> There are two issues with this:
> 1. The 'mflr r0' instruction at the beginning of each function remains
> even though ftrace is not being used.
> 2. When ftrace is activated, we return from ftrace_caller() with a bctr
> instruction to preserve r0 and LR, resulting in the link stack
> becoming unbalanced.
>
> To address (1), we have tried to nop'out the 'mflr r0' instruction when
> nop'ing out the call to ftrace_caller() and restoring it when enabling
> ftrace. But, that required additional synchronization slowing down
> ftrace activation. It also left an additional nop instruction at the
> beginning of each function and that wasn't desirable on 32-bit powerpc.
>
> Instead of that, move the function profile sequence out-of-line leaving
> a single nop at function entry. On ftrace activation, the nop is changed
> to an unconditional branch to the out-of-line sequence that in turn
> calls ftrace_caller(). This removes the need for complex synchronization
> during ftrace activation and simplifies the code. More importantly, this
> improves performance of the kernel when ftrace is not in use.
>
> To address (2), change the ftrace trampoline to return with a 'blr'
> instruction with the original return address in r0 intact. Then, an
> additional 'mtlr r0' instruction in the function profile sequence can
> move the correct return address back to LR.
>
> With the above two changes, the function profile sequence now looks like
> the following:
>
> [func: # GEP -- 64-bit powerpc, optional
> addis r2,r12,imm1
> addi r2,r2,imm2]
> tramp:
> mflr r0
> bl ftrace_caller
> mtlr r0
> b func
> nop
> [nop] # 64-bit powerpc only
> func: # LEP
> nop
>
> On 32-bit powerpc, the ftrace mcount trampoline is now completely
> outside the function. This is also the case on 64-bit powerpc for
> functions that do not need a GEP. However, for functions that need a
> GEP, the additional instructions are inserted between the GEP and the
> LEP. Since we can only have a fixed number of instructions between GEP
> and LEP, we choose to emit 6 instructions. Four of those instructions
> are used for the function profile sequence and two instruction slots are
> reserved for implementing support for DYNAMIC_FTRACE_WITH_CALL_OPS. On
> 32-bit powerpc, we emit one additional nop for this purpose resulting in
> a total of 5 nops before function entry.
>
> To enable ftrace, the nop at function entry is changed to an
> unconditional branch to 'tramp'. The call to ftrace_caller() may be
> updated to ftrace_regs_caller() depending on the registered ftrace ops.
> On 64-bit powerpc, we additionally change the instruction at 'tramp' to
> 'mflr r0' from an unconditional branch back to func+4. This is so that
> functions entered through the GEP can skip the function profile sequence
> unless ftrace is enabled.
>
> With the context_switch microbenchmark on a P9 machine, there is a
> performance improvement of ~6% with this patch applied, going from 650k
> context switches to 690k context switches without ftrace enabled. With
> ftrace enabled, the performance was similar at 86k context switches.

Wondering how significant that context_switch micorbenchmark is.

I ran it on both mpc885 and mpc8321 and I'm a bit puzzled by some of the
results:
# ./context_switch --no-fp
Using threads with yield on cpus 0/0 touching FP:no altivec:no vector:no
vdso:no

On 885, I get the following results before and after your patch.

CONFIG_FTRACE not selected : 44,9k
CONFIG_FTRACE selected, before : 32,8k
CONFIG_FTRACE selected, after : 33,6k

All this is with CONFIG_INIT_STACK_ALL_ZERO which is the default. But
when I select CONFIG_INIT_STACK_NONE, the CONFIG_FTRACE not selected
result is only 34,4.

On 8321:

CONFIG_FTRACE not selected : 100,3k
CONFIG_FTRACE selected, before : 72,5k
CONFIG_FTRACE selected, after : 116k

So the results look odd to me.

>
> The downside of this approach is the increase in vmlinux size,
> especially on 32-bit powerpc. We now emit 3 additional instructions for
> each function (excluding the one or two instructions for supporting
> DYNAMIC_FTRACE_WITH_CALL_OPS). On 64-bit powerpc with the current
> implementation of -fpatchable-function-entry though, this is not
> avoidable since we are forced to emit 6 instructions between the GEP and
> the LEP even if we are to only support DYNAMIC_FTRACE_WITH_CALL_OPS.

The increase is almost 5% on the few 32 bits defconfig I have tested.
That's a lot.

On 32 bits powerpc, only the e500 has a link stack that could end up
being unbalanced. Could we keep the bctr and avoid the mtlr and the jump
in the trampoline ?

On 8xx a NOP is one cycle, a taken branch is 2 cycles, but the second
cycle is a bubble that most of the time gets filled by following
operations. On the 83xx, branches and NOPs are supposed to be seamless.

So is that out-of-line trampoline really worth it ? Maybe keeping the
ftrace instructions at the begining and just replacing the mflr by an
jump when ftrace is off would help reduce the ftrace insns by one more
instruction.

>
> Signed-off-by: Naveen N Rao <naveen@xxxxxxxxxx>
> ---

> diff --git a/arch/powerpc/include/asm/code-patching.h b/arch/powerpc/include/asm/code-patching.h
> index 84f6ccd7de7a..9a54bb9e0dde 100644
> --- a/arch/powerpc/include/asm/code-patching.h
> +++ b/arch/powerpc/include/asm/code-patching.h
> @@ -185,10 +185,21 @@ static inline unsigned long ppc_function_entry(void *func)
> */
> if ((((*insn & OP_RT_RA_MASK) == ADDIS_R2_R12) ||
> ((*insn & OP_RT_RA_MASK) == LIS_R2)) &&
> - ((*(insn+1) & OP_RT_RA_MASK) == ADDI_R2_R2))
> + ((*(insn+1) & OP_RT_RA_MASK) == ADDI_R2_R2)) {
> +#ifdef CONFIG_ARCH_USING_PATCHABLE_FUNCTION_ENTRY

Can you replace by IS_ENABLED() ?

> + /*
> + * Heuristic: look for the 'mtlr r0' instruction assuming ftrace init is done.
> + * If it is not found, look for two consecutive nops after the GEP.
> + * Longer term, we really should be parsing the symbol table to determine LEP.
> + */
> + if ((*(insn+4) == PPC_RAW_MTLR(_R0)) ||
> + ((*(insn+2) == PPC_RAW_NOP() && *(insn+3) == PPC_RAW_NOP())))
> + return (unsigned long)(insn + 8);
> +#endif
> return (unsigned long)(insn + 2);
> - else
> + } else {
> return (unsigned long)func;
> + }
> #elif defined(CONFIG_PPC64_ELF_ABI_V1)
> /*
> * On PPC64 ABIv1 the function pointer actually points to the

> diff --git a/arch/powerpc/kernel/trace/ftrace.c b/arch/powerpc/kernel/trace/ftrace.c
> index 2956196c98ff..d3b4949142a8 100644
> --- a/arch/powerpc/kernel/trace/ftrace.c
> +++ b/arch/powerpc/kernel/trace/ftrace.c

> @@ -217,15 +274,62 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
> {
> unsigned long addr, ip = rec->ip;
> ppc_inst_t old, new;
> - int ret = 0;
> + int i, ret = 0;
> + u32 ftrace_mcount_tramp_insns[] = {
> +#ifdef CONFIG_PPC64
> + PPC_RAW_BRANCH(FTRACE_MCOUNT_TRAMP_OFFSET + MCOUNT_INSN_SIZE),
> +#else
> + PPC_RAW_MFLR(_R0),
> +#endif

Can it be based on IS_ENABLED(CONFIG_PPC64) instead ?

> + PPC_RAW_BL(0), /* bl ftrace_caller */
> + PPC_RAW_MTLR(_R0), /* also see update ppc_function_entry() */
> + PPC_RAW_BRANCH(FTRACE_MCOUNT_TRAMP_OFFSET - MCOUNT_INSN_SIZE * 2)
> + };
> +