Re: [PATCH v4 6/9] tracing/fprobe: Enable fprobe events with CONFIG_DYNAMIC_FTRACE_WITH_ARGS

From: Florent Revest
Date: Fri Aug 25 2023 - 12:13:15 EST


On Wed, Aug 23, 2023 at 5:16 PM Masami Hiramatsu (Google)
<mhiramat@xxxxxxxxxx> wrote:
>
> diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
> index c60d0d9f1a95..90ad28260a9f 100644
> --- a/kernel/trace/trace_fprobe.c
> +++ b/kernel/trace/trace_fprobe.c
> +#else /* CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS && !CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST */
> +
> +/* Since fprobe handlers can be nested, pt_regs buffer need to be a stack */
> +#define PERF_FPROBE_REGS_MAX 4
> +
> +struct pt_regs_stack {
> + struct pt_regs regs[PERF_FPROBE_REGS_MAX];
> + int idx;
> +};
> +
> +static DEFINE_PER_CPU(struct pt_regs_stack, perf_fprobe_regs);
> +
> +static __always_inline
> +struct pt_regs *perf_fprobe_partial_regs(struct ftrace_regs *fregs)
> +{
> + struct pt_regs_stack *stack = this_cpu_ptr(&perf_fprobe_regs);
> + struct pt_regs *regs;
> +
> + if (stack->idx < PERF_FPROBE_REGS_MAX) {
> + regs = stack->regs[stack->idx++];

This is missing an &:
regs = &stack->regs[stack->idx++];

> + return ftrace_partial_regs(fregs, regs);

I think this is incorrect on arm64 and will likely cause very subtle
failure modes down the line on other architectures too. The problem on
arm64 is that Perf calls "user_mode(regs)" somewhere down the line,
that macro tries to read the "pstate" register, which is not populated
in ftrace_regs, so it's not copied into a "partial" pt_regs either and
Perf can take wrong decisions based on that.

I already mentioned this problem in the past:
- in the third answer block of:
https://lore.kernel.org/all/CABRcYmJjtVq-330ktqTAUiNO1=yG_aHd0xz=c550O5C7QP++UA@xxxxxxxxxxxxxx/
- in the fourth answer block of:
https://lore.kernel.org/all/CABRcYm+esb8J2O1v6=C+h+HSa5NxraPUgo63w7-iZj0CXbpusg@xxxxxxxxxxxxxx/

It is quite possible that other architectures at some point introduce
a light ftrace "args" trampoline that misses one of the registers
expected by Perf because they don't realize that this trampoline calls
fprobe which calls Perf which has specific registers expectations.

We got the green light from Alexei to use ftrace_partial_regs for "BPF
mutli_kprobe" because these BPF programs can gracefully deal with
sparse pt_regs but I think a similar conversation needs to happen with
the Perf folks.

----

On a side-note, a subtle difference between ftrace_partial_regs with
and without HAVE_PT_REGS_TO_FTRACE_REGS_CAST is that one does a copy
and the other does not. If a subsystem receives a partial regs under
HAVE_PT_REGS_TO_FTRACE_REGS_CAST, it can modify register fields and
the modified values will be restored by the ftrace trampoline. Without
HAVE_PT_REGS_TO_FTRACE_REGS_CAST, only the copy will be modified and
ftrace won't restore them. I think the least we can do is to document
thoroughly the guarantees of the ftrace_partial_regs API: users
shouldn't rely on modifying the resulting regs because depending on
the architecture this could do different things. People shouldn't rely
on any register that isn't covered by one of the ftrace_regs_get_*
helpers because it can be unpopulated on some architectures. I believe
this is the case for BPF multi_kprobe but not for Perf.

> + }
> + return NULL;
> +}
> +
> +static __always_inline void perf_fprobe_return_regs(struct pt_regs *regs)
> +{
> + struct pt_regs_stack *stack = this_cpu_ptr(&perf_fprobe_regs);
> +
> + if (WARN_ON_ONCE(regs != stack->regs[stack->idx]))

This is missing an & too:
if (WARN_ON_ONCE(regs != &stack->regs[stack->idx]))




> + return;
> +
> + --stack->idx;
> +}
> +
> +#endif /* !CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS || CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST */