Re: [PATCH bpf-next v2 3/6] bpf: Introduce process open coded iterator kfuncs

From: Chuyi Zhou
Date: Sat Sep 16 2023 - 10:04:55 EST


Hello.

在 2023/9/16 04:37, Andrii Nakryiko 写道:
On Fri, Sep 15, 2023 at 8:03 AM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote:
在 2023/9/15 07:26, Andrii Nakryiko 写道:
On Tue, Sep 12, 2023 at 12:02 AM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote:

This patch adds kfuncs bpf_iter_process_{new,next,destroy} which allow
creation and manipulation of struct bpf_iter_process in open-coded iterator
style. BPF programs can use these kfuncs or through bpf_for_each macro to
iterate all processes in the system.

[...cut...]

Few high level thoughts. I think it would be good to follow
SEC("iter/task") naming and approach. Open-coded iterators in many
ways are in-kernel counterpart to iterator programs, so keeping them
close enough within reason is useful for knowledge transfer.

SEC("iter/task") allows to:
a) iterate all threads in the system
b) iterate all threads for a given TGID
c) it also allows to "iterate" a single thread or process, but that's
a bit less relevant for in-kernel iterator, but we can still support
them, why not?

I'm not sure if it supports iterating all processes (as in group
leaders of each task group) in the system, but if it's possible I
think we should support it at least for open-coded iterator, seems
like a very useful functionality.

So to that end, let's design a small set of input arguments for
bpf_iter_process_new() that would allow to specify this as flags +
either (optional) struct task_struct * pointer to represent
task/process or PID/TGID.


Another concern from Alexei was the readability of the API of open-coded
in BPF Program[1].

bpf_for_each(task, curr) is straightforward. Users can easily understand
that this API does the same thing as 'for_each_process' in kernel.

In general, users might have no idea about for_each_process macro in
the kernel, so I don't find this particular argument very convincing.

We can add a separate set of iterator kfuncs for every useful
combination of conditions, of course, but it's a double-edged sword.
Needing to use a different iterator just to specify a different
direction of cgroup iteration (from the example you referred in [1])
also means that it's now harder to write some generic function that
needs to do something for all cgroups matching some criteria where the
order might be coming as an argument.

Similarly for task iterators. It's not hard to imagine some processing
that can be equivalently done per thread or per process in the system,
or on each thread of the process, depending on some conditions or
external configuration. Having to do three different
bpf_for_each(task_xxx, task, ...) for this seems suboptimal. If the
nature of the thing that is iterated over is the same, and it's just a
different set of filters to specify which subset of those items should
be iterated, I think it's better to try to stick to the same iterator
with few simple arguments. IMO, of course, there is no objectively
best approach.


However, if we keep the approach of SEC("iter/task")

enum ITER_ITEM {
ITER_TASK,
ITER_THREAD,
}

__bpf_kfunc int bpf_iter_task_new(struct bpf_iter_process *it, struct
task_struct *group_task, enum ITER_ITEM type)

the API have to chang:


bpf_for_each(task, curr, NULL, ITERATE_TASK) // iterate all process in
the system
bpf_for_each(task, curr, group_leader, ITERATE_THREAD) // iterate all
thread of group_leader
bpf_for_each(task, curr, NULL, ITERATE_THREAD) //iterate all threads of
all the process in the system

Useres may guess what are this API actually doing....

I'd expect users to consult documentation before trying to use an
unfamiliar cutting-edge functionality. So let's try to keep
documentation clear and up to the point. Extra flag argument doesn't
seem to be a big deal.

Thanks for your suggestion!

Before we begin working on the next version, I have outlined a detailed API design here:

1.task_iter

It will be used to iterate process/threads like SEC("iter/task"). Here we should better to follow the naming and approach SEC("iter/task"):

enum {
ITERATE_PROCESS,
ITERATE_THREAD,
}

__bpf_kfunc int bpf_iter_task_new(struct bpf_iter_task *it, struct task_struct *task, int flag);

If we want to iterate all processes in the system, the iteration will start from the *task* which is passed from user.(since process in the system are connected through a linked list)

Additionally, the *task* can allow users to specify iterating all threads within a task group.

SEC("xxx")
int xxxx(void *ctx)
{
struct task_struct *pos;
struct task_struct *cur_task = bpf_get_current_task_btf();

bpf_rcu_read_lock();

// iterating all process in the system start from cur_task
bpf_for_each(task, pos, cur_task, ITERATE_PROCESS) {

}

// iterate all thread belongs to cur_task group.
bpf_for_each(task, pos, cur_task, ITERATE_THREAD) {

}

bpf_rcu_read_unlock();
return 0;
}

Iterating all thread of each process is great(ITERATE_ALL). But maybe let's break it down step by step and implement ITERATE_PROCESS/ITERATE_THREAD first? (I'm little worried about the cpu overhead of ITERATE_ALL, since we are doing a heavy job in BPF Prog)

I wanted to reuse BPF_TASK_ITER_ALL/BPF_TASK_ITER_TID/BPF_TASK_ITER_TGID insted of new enums like ITERATE_PROCESS/ITERATE_THREAD. But it seems necessary. In BPF Prog, we usually operate task_struct directly instead of pid/tgid. It's a little weird to use BPF_TASK_ITER_TID/BPF_TASK_ITER_TGID here:

bpf_for_each(task, pos, cur_task, BPF_TASK_ITER_TID) {
}

On the other hand, BPF_TASK_ITER_ALL/BPF_TASK_ITER_TID/BPF_TASK_ITER_TGID are inner flags that are hidden from the users.
Exposing ITERATE_PROCESS/ITERATE_THREAD will not cause confusion to user.


2. css_iter.

css_iter will be used to:
(1) iterating subsystem, like for_each_mem_cgroup_tree/cpuset_for_each_descendant_pre in kernel.
(2) iterating cgroup. (patch-6's selfetest has a basic example)

css(cgroup_subsys_state) is more fundamental than struct cgroup. I think we'd better operating css rather than cgroup, since it's can be hard for cgroup_iter to achive (2). So here we keep the name of "css_iter", BPF_CGROUP_ITER_DESCENDANTS_PRE/BPF_CGROUP_ITER_DESCENDANTS_POST/BPF_CGROUP_ITER_ANCESTORS_UP can be reused.


__bpf_kfunc int bpf_iter_css_new(struct bpf_iter_css *it,
struct cgroup_subsys_state *root, unsigned int flag)

bpf_for_each(css, root, BPF_CGROUP_ITER_DESCENDANTS_PRE)

Thanks.