Re: [PATCH bpf-next v3 3/7] bpf: Introduce task open coded iterator kfuncs

From: Andrii Nakryiko
Date: Fri Sep 29 2023 - 17:28:11 EST


On Wed, Sep 27, 2023 at 8:29 PM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote:
>
> Hello,
>
> 在 2023/9/28 07:20, Andrii Nakryiko 写道:
> > On Mon, Sep 25, 2023 at 3:56 AM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote:
> >>
> >> This patch adds kfuncs bpf_iter_task_{new,next,destroy} which allow
> >> creation and manipulation of struct bpf_iter_task in open-coded iterator
> >> style. BPF programs can use these kfuncs or through bpf_for_each macro to
> >> iterate all processes in the system.
> >>
> >> The API design keep consistent with SEC("iter/task"). bpf_iter_task_new()
> >> accepts a specific task and iterating type which allows:
> >> 1. iterating all process in the system
> >>
> >> 2. iterating all threads in the system
> >>
> >> 3. iterating all threads of a specific task
> >> Here we also resuse enum bpf_iter_task_type and rename BPF_TASK_ITER_TID
> >> to BPF_TASK_ITER_THREAD, rename BPF_TASK_ITER_TGID to BPF_TASK_ITER_PROC.
> >>
> >> The newly-added struct bpf_iter_task has a name collision with a selftest
> >> for the seq_file task iter's bpf skel, so the selftests/bpf/progs file is
> >> renamed in order to avoid the collision.
> >>
> >> Signed-off-by: Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx>
> >> ---
> >> include/linux/bpf.h | 8 +-
> >> kernel/bpf/helpers.c | 3 +
> >> kernel/bpf/task_iter.c | 96 ++++++++++++++++---
> >> .../testing/selftests/bpf/bpf_experimental.h | 5 +
> >> .../selftests/bpf/prog_tests/bpf_iter.c | 18 ++--
> >> .../{bpf_iter_task.c => bpf_iter_tasks.c} | 0
> >> 6 files changed, 106 insertions(+), 24 deletions(-)
> >> rename tools/testing/selftests/bpf/progs/{bpf_iter_task.c => bpf_iter_tasks.c} (100%)
> >>
> >
> > [...]
> >
> >> @@ -692,9 +692,9 @@ static int bpf_iter_fill_link_info(const struct bpf_iter_aux_info *aux, struct b
> >> static void bpf_iter_task_show_fdinfo(const struct bpf_iter_aux_info *aux, struct seq_file *seq)
> >> {
> >> seq_printf(seq, "task_type:\t%s\n", iter_task_type_names[aux->task.type]);
> >> - if (aux->task.type == BPF_TASK_ITER_TID)
> >> + if (aux->task.type == BPF_TASK_ITER_THREAD)
> >> seq_printf(seq, "tid:\t%u\n", aux->task.pid);
> >> - else if (aux->task.type == BPF_TASK_ITER_TGID)
> >> + else if (aux->task.type == BPF_TASK_ITER_PROC)
> >> seq_printf(seq, "pid:\t%u\n", aux->task.pid);
> >> }
> >>
> >> @@ -856,6 +856,80 @@ __bpf_kfunc void bpf_iter_css_task_destroy(struct bpf_iter_css_task *it)
> >> bpf_mem_free(&bpf_global_ma, kit->css_it);
> >> }
> >>
> >> +struct bpf_iter_task {
> >> + __u64 __opaque[2];
> >> + __u32 __opaque_int[1];
> >
> > this should be __u64 __opaque[3], because struct takes full 24 bytes
> >
> >> +} __attribute__((aligned(8)));
> >> +
> >> +struct bpf_iter_task_kern {
> >> + struct task_struct *task;
> >> + struct task_struct *pos;
> >> + unsigned int type;
> >> +} __attribute__((aligned(8)));
> >> +
> >> +__bpf_kfunc int bpf_iter_task_new(struct bpf_iter_task *it, struct task_struct *task, unsigned int type)
> >
> > nit: type -> flags, so we can add a bit more stuff, if necessary
> >
> >> +{
> >> + struct bpf_iter_task_kern *kit = (void *)it;
> >
> > empty line after variable declarations
> >
> >> + BUILD_BUG_ON(sizeof(struct bpf_iter_task_kern) != sizeof(struct bpf_iter_task));
> >> + BUILD_BUG_ON(__alignof__(struct bpf_iter_task_kern) !=
> >> + __alignof__(struct bpf_iter_task));
> >
> > and I'd add empty line here to keep BUILD_BUG_ON block separate
> >
> >> + kit->task = kit->pos = NULL;
> >> + switch (type) {
> >> + case BPF_TASK_ITER_ALL:
> >> + case BPF_TASK_ITER_PROC:
> >> + case BPF_TASK_ITER_THREAD:
> >> + break;
> >> + default:
> >> + return -EINVAL;
> >> + }
> >> +
> >> + if (type == BPF_TASK_ITER_THREAD)
> >> + kit->task = task;
> >> + else
> >> + kit->task = &init_task;
> >> + kit->pos = kit->task;
> >> + kit->type = type;
> >> + return 0;
> >> +}
> >> +
> >> +__bpf_kfunc struct task_struct *bpf_iter_task_next(struct bpf_iter_task *it)
> >> +{
> >> + struct bpf_iter_task_kern *kit = (void *)it;
> >> + struct task_struct *pos;
> >> + unsigned int type;
> >> +
> >> + type = kit->type;
> >> + pos = kit->pos;
> >> +
> >> + if (!pos)
> >> + goto out;
> >> +
> >> + if (type == BPF_TASK_ITER_PROC)
> >> + goto get_next_task;
> >> +
> >> + kit->pos = next_thread(kit->pos);
> >> + if (kit->pos == kit->task) {
> >> + if (type == BPF_TASK_ITER_THREAD) {
> >> + kit->pos = NULL;
> >> + goto out;
> >> + }
> >> + } else
> >> + goto out;
> >> +
> >> +get_next_task:
> >> + kit->pos = next_task(kit->pos);
> >> + kit->task = kit->pos;
> >> + if (kit->pos == &init_task)
> >> + kit->pos = NULL;
> >
> > I can't say I completely follow the logic (e.g., for
> > BPF_TASK_ITER_PROC, why do we do next_task() on first next() call)?
> > Can you elabore the expected behavior for various combinations of
> > types and starting task argument?
> >
>
> Thanks for the review.
>
> The expected behavior of current implementation is:
>
> BPF_TASK_ITER_PROC:
>
> init_task->first_process->second_process->...->last_process->init_task
>
> We would exit before visiting init_task again.

ah, ok, so in this case it's more like BPF_TASK_ITER_ALL_PROCS, i.e.,
we iterate all processes in the system. Input `task` that we provide
is ignored/meaningless, right? Maybe we should express it as
ALL_PROCS?

>
> BPF_TASK_ITER_THREAD:
>
> group_task->first_thread->second_thread->...->last_thread->group_task
>
> We would exit before visiting group_task again.
>

And this one is iterating threads of a process specified by given
`task`, right? This is where my confusion comes from. ITER_PROC and
ITER_THREAD, by their name, seems to be very similar, but in reality
ITER_PROC is more like ITER_ALL (except process vs thread iteration),
while ITER_THREAD is parameterized by input `task`.

I'm not sure what's the least confusing way to name and organize
everything, but I think it's quite confusing right now, unfortunately.
I wonder if you or someone else have a better suggestion on making
this more straightforward?


> BPF_TASK_ITER_ALL:
>
> init_task -> first_process -> second_process -> ...
> | |
> -> first_thread.. |
> -> first_thread
>

Ok, and this one is "all threads in the system". So
BPF_TASK_ITER_ALL_THREADS would describe it more precisely, right?

> Actually, every next() call, we would return the "pos" which was
> prepared by previous next() call, and use next_task()/next_thread() to
> update kit->pos. Once we meet the exit condition (next_task() return
> init_task or next_thread() return group_task), we would update kit->pos
> to NULL. In this way, when next() is called again, we will terminate the
> iteration.
>
> Here "kit->pos = NULL;" means we would return the last valid "pos" and
> will return NULL in next call to exit from the iteration.
>
> Am I miss something important?

No, it's my bad. I did check, but somehow concluded that you are
returning kit->pos, but you are returning locally cached previous
value of kit->pos. All good here, I think.

>
> Thanks.
>
>
>