Re: [RFC PATCH 0/5] mm: Select victim memcg using BPF_OOM_POLICY

From: Chuyi Zhou
Date: Tue Aug 01 2023 - 23:05:32 EST

在 2023/8/1 16:18, Michal Hocko 写道:
On Tue 01-08-23 00:26:20, Chuyi Zhou wrote:

在 2023/7/31 21:23, Michal Hocko 写道:
On Mon 31-07-23 14:00:22, Chuyi Zhou wrote:
Hello, Michal

在 2023/7/28 01:23, Michal Hocko 写道:
This sounds like a very specific oom policy and that is fine. But the
interface shouldn't be bound to any concepts like priorities let alone
be bound to memcg based selection. Ideally the BPF program should get
the oom_control as an input and either get a hook to kill process or if
that is not possible then return an entity to kill (either process or
set of processes).

Here are two interfaces I can think of. I was wondering if you could give me
some feedback.

1. Add a new hook in select_bad_process(), we can attach it and return a set
of pids or cgroup_ids which are pre-selected by user-defined policy,
suggested by Roman. Then we could use oom_evaluate_task to find a final
victim among them. It's user-friendly and we can offload the OOM policy to

2. Add a new hook in oom_evaluate_task() and return a point to override the
default oom_badness return-value. The simplest way to use this is to protect
certain processes by setting the minimum score.

Of course if you have a better idea, please let me know.

Hooking into oom_evaluate_task seems the least disruptive to the
existing oom killer implementation. I would start by planing with that
and see whether useful oom policies could be defined this way. I am not
sure what is the best way to communicate user input so that a BPF prgram
can consume it though. The interface should be generic enough that it
doesn't really pre-define any specific class of policies. Maybe we can
add something completely opaque to each memcg/task? Does BPF
infrastructure allow anything like that already?

“Maybe we can add something completely opaque to each memcg/task?”
Sorry, I don't understand what you mean.

What I meant to say is to add a very non-specific interface that would
would a specific BPF program understand. Mostly an opaque value from the
memcg POV.

I think we probably don't need to expose too much to the user, the following
might be sufficient:

noinline int bpf_get_score(struct oom_control *oc,
struct task_struct *task);

static int oom_evaluate_task()
points = bpf_get_score(oc, task);
if (!check_points_valid(points))
points = oom_badness(task, oc->totalpages);

There are several reasons:

1. The implementation of use-defined OOM policy, such as iteration, sorting
and other complex operations, is more suitable to be placed in the userspace
rather than in the bpf program. It is more convenient to implement these
operations in userspace in which the useful information (memory usage of
each task and memcg, memory allocation speed, etc.) can also be captured.
For example, oomd implements multiple policies[1] without kernel-space

I do agree that userspace can handle a lot on its own and provide the
input to the BPF program to make a decision.

2. Userspace apps, such as oomd, can import useful information into bpf
program, e.g., through bpf_map, and update it periodically. For example, we
can do the scoring directly in userspace and maintain a score hash, so that
in the bpf program, we only need to look for the corresponding score of the

Sure, why not. But all that is an implementation detail. We are
currently talkin about a proper abstraction and layering that would
allow what you do currently but also much more.

Userspace policy(oomd)
------------------> BPF program
look up score in
---------------> kernel space
Just some thoughts.

I believe all the above should be possible if BPF program is hooked at
the oom_evaluate_task layer and allow to bypass the default logic. BPF
program can process whatever data it has available. The oom scope iteration
will be implemented already in the kernel so all the BPF program has to
do is to rank processes and/or memcgs if is enabled. Whould
that work for your usecase?

Yes, I think the above interface can works well for our usecase.

In our scenario, we want to protect the application with higher priority and try to select lower priority as the victim.

Specifically, We can set priority for memcgs in userspace. In BPF program, we can find the memcg to which the given process belongs, and then rank according to the memcg's priority.