Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task

From: Chuyi Zhou
Date: Thu Aug 17 2023 - 23:33:10 EST


Hello,
在 2023/8/17 11:22, Alexei Starovoitov 写道:
On Wed, Aug 16, 2023 at 7:51 PM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote:

Hello,

在 2023/8/17 10:07, Alexei Starovoitov 写道:
On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote:
static int oom_evaluate_task(struct task_struct *task, void *arg)
{
struct oom_control *oc = arg;
@@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
goto next;

+ /*
+ * If task is allocating a lot of memory and has been marked to be
+ * killed first if it triggers an oom, then select it.
+ */
+ if (oom_task_origin(task)) {
+ points = LONG_MAX;
+ goto select;
+ }
+
+ switch (bpf_oom_evaluate_task(task, oc)) {
+ case BPF_EVAL_ABORT:
+ goto abort; /* abort search process */
+ case BPF_EVAL_NEXT:
+ goto next; /* ignore the task */
+ case BPF_EVAL_SELECT:
+ goto select; /* select the task */
+ default:
+ break; /* No BPF policy */
+ }
+

I think forcing bpf prog to look at every task is going to be limiting
long term.
It's more flexible to invoke bpf prog from out_of_memory()
and if it doesn't choose a task then fallback to select_bad_process().
I believe that's what Roman was proposing.
bpf can choose to iterate memcg or it might have some side knowledge
that there are processes that can be set as oc->chosen right away,
so it can skip the iteration.

IIUC, We may need some new bpf features if we want to iterating
tasks/memcg in BPF, sush as:
bpf_for_each_task
bpf_for_each_memcg
bpf_for_each_task_in_memcg
...

It seems we have some work to do first in the BPF side.
Will these iterating features be useful in other BPF scenario except OOM
Policy?

Yes.
Use open coded iterators though.
Like example in
https://lore.kernel.org/all/20230810183513.684836-4-davemarchevsky@xxxxxx/

bpf_for_each(task_vma, vma, task, 0) { ... }
will safely iterate vma-s of the task.
Similarly struct css_task_iter can be hidden inside bpf open coded iterator.
OK. I think the following APIs whould be useful and I am willing to start with these in another bpf-next RFC patchset:

1. bpf_for_each(task). Just like for_each_process(p) in kernel to itearing all tasks in the system with rcu_read_lock().

2. bpf_for_each(css_task, task, css). It works like css_task_iter_{start, next, end} and would be used to iterating tasks/threads under a css.

3. bpf_for_each(descendant_css, css, root_css, {PRE, POST}). It works like css_next_descendant_{pre, post} to iterating all descendant.

If you have better ideas or any advice, please let me know.
Thanks.