Re: [PATCH bpf-next v5 2/3] bpf, x86: allow function arguments up to 12 for TRACING

From: Menglong Dong
Date: Sun Jun 18 2023 - 22:32:04 EST


On Mon, Jun 19, 2023 at 7:11 AM Yonghong Song <yhs@xxxxxxxx> wrote:
>
>
>
> On 6/12/23 7:52 PM, menglong8.dong@xxxxxxxxx wrote:
> > From: Menglong Dong <imagedong@xxxxxxxxxxx>
> >
> > For now, the BPF program of type BPF_PROG_TYPE_TRACING can only be used
> > on the kernel functions whose arguments count less than 6. This is not
> > friendly at all, as too many functions have arguments count more than 6.
> >
> > According to the current kernel version, below is a statistics of the
> > function arguments count:
> >
> > argument count | function count
> > 7 | 704
> > 8 | 270
> > 9 | 84
> > 10 | 47
> > 11 | 47
> > 12 | 27
> > 13 | 22
> > 14 | 5
> > 15 | 0
> > 16 | 1
> >
> > Therefore, let's enhance it by increasing the function arguments count
> > allowed in arch_prepare_bpf_trampoline(), for now, only x86_64.
> >
> > For the case that we don't need to call origin function, which means
> > without BPF_TRAMP_F_CALL_ORIG, we need only copy the function arguments
> > that stored in the frame of the caller to current frame. The arguments
> > of arg6-argN are stored in "$rbp + 0x18", we need copy them to
> > "$rbp - regs_off + (6 * 8)".
> >
> > For the case with BPF_TRAMP_F_CALL_ORIG, we need prepare the arguments
> > in stack before call origin function, which means we need alloc extra
> > "8 * (arg_count - 6)" memory in the top of the stack. Note, there should
> > not be any data be pushed to the stack before call the origin function.
> > Then, we have to store rbx with 'mov' instead of 'push'.
>
> Please also mention special case related to 16-byte struct argument
> in the comments of save_args().
>
> >
> > We use EMIT3_off32() or EMIT4() for "lea" and "sub". The range of the
> > imm in "lea" and "sub" is [-128, 127] if EMIT4() is used. Therefore,
> > we use EMIT3_off32() instead if the imm out of the range.
> >
> > It works well for the FENTRY/FEXIT/MODIFY_RETURN.
> >
> > Signed-off-by: Menglong Dong <imagedong@xxxxxxxxxxx>
> > ---
> > v5:
> > - consider the case of the struct in arguments can't be hold by regs
> > v4:
> > - make the stack 16-byte aligned if passing args on-stack is needed
> > - add the function arguments statistics to the commit log
> > v3:
> > - use EMIT3_off32() for "lea" and "sub" only on necessary
> > - make 12 as the maximum arguments count
> > v2:
> > - instead EMIT4 with EMIT3_off32 for "lea" to prevent overflow
> > - make MAX_BPF_FUNC_ARGS as the maximum argument count
> > ---
> > arch/x86/net/bpf_jit_comp.c | 221 +++++++++++++++++++++++++++++++-----
> > 1 file changed, 195 insertions(+), 26 deletions(-)
> >
> > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > index a407fbbffecd..47c699594dd8 100644
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -1857,37 +1857,165 @@ st: if (is_imm8(insn->off))
> > return proglen;
> > }
> >
> > -static void save_regs(const struct btf_func_model *m, u8 **prog, int nr_regs,
> > - int stack_size)
> > +static inline void clean_stack_garbage(const struct btf_func_model *m,
> > + u8 **pprog, int nr_args_on_stack,
>
> nr_args_on_stack is actually nr_stack_slots, right? Maybe rename to
> nr_stack_slots?
>
> > + int stack_size)
> > {
> > - int i;
> > + int arg_size, off;
> > + u8 *prog;
> > +
> > + if (nr_args_on_stack != 1)
> > + return;
> > +
> > + /* the size of the last argument */
> > + arg_size = m->arg_size[m->nr_args - 1];
> > +
> > + /* Generally speaking, the compiler will pass the arguments
> > + * on-stack with "push" instruction, which will take 8-byte
> > + * on the stack. On this case, there won't be garbage values
>
> On this case -> In this case. The same for below another case.
>
> > + * while we copy the arguments from origin stack frame to current
> > + * in BPF_DW.
> > + *
> > + * However, sometimes the compiler will only allocate 4-byte on
> > + * the stack for the arguments. For now, this case will only
> > + * happen if there is only one argument on-stack and its size
> > + * not more than 4 byte. On this case, there will be garbage
> > + * values on the upper 4-byte where we store the argument on
> > + * current stack frame.
> > + *
> > + * arguments on origin stack:
> > + *
> > + * stack_arg_1(4-byte) xxx(4-byte)
> > + *
> > + * what we copy:
> > + *
> > + * stack_arg_1(8-byte): stack_arg_1(origin) xxx
> > + *
> > + * and the xxx is the garbage values which we should clean here.
> > + */
>
> let us put the above comments before
> > + if (nr_args_on_stack != 1)
> > + return;
>
>
> > + if (arg_size <= 4) {
> > + off = -(stack_size - 4);
> > + prog = *pprog;
> > + /* mov DWORD PTR [rbp + off], 0 */
> > + if (!is_imm8(off))
> > + EMIT2_off32(0xC7, 0x85, off);
> > + else
> > + EMIT3(0xC7, 0x45, off);
> > + EMIT(0, 4);
> > + *pprog = prog;
> > + }
> > +}
> > +
> > +static void save_args(const struct btf_func_model *m, u8 **prog,
> > + int stack_size, bool on_stack)
>
> Rename 'on_stack' to 'for_call_origin'? This should be more
> clear about the use case.
>
> > +{
> > + int arg_regs, first_off, nr_regs = 0, nr_stack = 0;
> nr_stack -> nr_stack_slots?
> > + int i, j;
> >
> > /* Store function arguments to stack.
> > * For a function that accepts two pointers the sequence will be:
> > * mov QWORD PTR [rbp-0x10],rdi
> > * mov QWORD PTR [rbp-0x8],rsi
> > */
> > - for (i = 0; i < min(nr_regs, 6); i++)
> > - emit_stx(prog, BPF_DW, BPF_REG_FP,
> > - i == 5 ? X86_REG_R9 : BPF_REG_1 + i,
> > - -(stack_size - i * 8));
> > + for (i = 0; i < min_t(int, m->nr_args, MAX_BPF_FUNC_ARGS); i++) {
> > + arg_regs = (m->arg_size[i] + 7) / 8;
> > +
> > + /* According to the research of Yonghong, struct members
> > + * should be all in register or all on the stack.
> > + * Meanwhile, the compiler will pass the argument on regs
> > + * if the remained regs can hold the argument.
> remained -> remaining
> > + *
> > + * Disorder of the args can happen. For example:
> > + *
> > + * struct foo_struct {
> > + * long a;
> > + * int b;
> > + * };
> > + * int foo(char, char, char, char, char, struct foo_struct,
> > + * char);
> > + *
> > + * the arg1-5,arg7 will be passed by regs, and arg6 will
> > + * by stack.
> > + *
> > + * Therefore, we should keep the same logic as here when
> > + * we restore the regs in restore_regs.
> > + */
> > + if (nr_regs + arg_regs > 6) {
> > + /* copy function arguments from origin stack frame
> > + * into current stack frame.
> > + *
> > + * The starting address of the arguments on-stack
> > + * is:
> > + * rbp + 8(push rbp) +
> > + * 8(return addr of origin call) +
> > + * 8(return addr of the caller)
> > + * which means: rbp + 24
> > + */
> > + for (j = 0; j < arg_regs; j++) {
> > + emit_ldx(prog, BPF_DW, BPF_REG_0, BPF_REG_FP,
> > + nr_stack * 8 + 0x18);
> > + emit_stx(prog, BPF_DW, BPF_REG_FP, BPF_REG_0,
> > + -stack_size);
> > +
> > + if (!nr_stack)
> > + first_off = stack_size;
> > + stack_size -= 8;
> > + nr_stack++;
> > + }
> > + } else {
> > + /* Only copy the arguments on-stack to current
> > + * 'stack_size' and ignore the regs, used to
> > + * prepare the arguments on-stack for orign call.
> > + */
> > + if (on_stack) {
> > + nr_regs += arg_regs;
> > + continue;
> > + }
> > +
> > + /* copy the arguments from regs into stack */
> > + for (j = 0; j < arg_regs; j++) {
> > + emit_stx(prog, BPF_DW, BPF_REG_FP,
> > + nr_regs == 5 ? X86_REG_R9 : BPF_REG_1 + i,
> > + -stack_size);
> > + stack_size -= 8;
> > + nr_regs++;
> > + }
> > + }
> > + }
> > +
> > + clean_stack_garbage(m, prog, nr_stack, first_off);
> > }
> >
> [...]
> > /* Generated trampoline stack layout:
> > @@ -2147,7 +2286,14 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
> > *
> > * RBP - ip_off [ traced function ] BPF_TRAMP_F_IP_ARG flag
> > *
> > + * RBP - rbx_off [ rbx value ] always
> > + *
> > * RBP - run_ctx_off [ bpf_tramp_run_ctx ]
> > + *
> > + * [ stack_argN ] BPF_TRAMP_F_CALL_ORIG
> > + * [ ... ]
> > + * [ stack_arg2 ]
> > + * RBP - arg_stack_off [ stack_arg1 ]
> > */
> >
> > /* room for return value of orig_call or fentry prog */
> > @@ -2167,9 +2313,25 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
> >
> > ip_off = stack_size;
> >
> > + stack_size += 8;
> > + rbx_off = stack_size;
> > +
> > stack_size += (sizeof(struct bpf_tramp_run_ctx) + 7) & ~0x7;
> > run_ctx_off = stack_size;
> >
> > + if (nr_regs > 6 && (flags & BPF_TRAMP_F_CALL_ORIG)) {
> > + stack_size += (nr_regs - 6) * 8;
>
> Please double check. Is this okay for the case below?
> foo(int, int, int, int, int, 16_byte_struct)
> here, nr_regs is 7, yes, to-be-increased stack size should be 2.
>

You are right, here should be:
stack_size += (nr_regs - nr_arg_on_regs) * 8

The test case "bpf_testmod_fentry_test_struct1" shouldn't have
passed, and I'll figure out the reason too.

Thank you for the comment above, and I'll change them in
the next version too.

>
> > + /* make sure the stack pointer is 16-byte aligned if we
> > + * need pass arguments on stack, which means
> > + * [stack_size + 8(rbp) + 8(rip) + 8(origin rip)]
> > + * should be 16-byte aligned. Following code depend on
> > + * that stack_size is already 8-byte aligned.
> > + */
> > + stack_size += (stack_size % 16) ? 0 : 8;
> > + }
> > +
> > + arg_stack_off = stack_size;
> > +
> [...]