Re: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task

From: Michal Hocko
Date: Tue Aug 22 2023 - 06:40:00 EST


On Wed 16-08-23 19:07:10, Alexei Starovoitov wrote:
> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote:
> > static int oom_evaluate_task(struct task_struct *task, void *arg)
> > {
> > struct oom_control *oc = arg;
> > @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> > if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
> > goto next;
> >
> > + /*
> > + * If task is allocating a lot of memory and has been marked to be
> > + * killed first if it triggers an oom, then select it.
> > + */
> > + if (oom_task_origin(task)) {
> > + points = LONG_MAX;
> > + goto select;
> > + }
> > +
> > + switch (bpf_oom_evaluate_task(task, oc)) {
> > + case BPF_EVAL_ABORT:
> > + goto abort; /* abort search process */
> > + case BPF_EVAL_NEXT:
> > + goto next; /* ignore the task */
> > + case BPF_EVAL_SELECT:
> > + goto select; /* select the task */
> > + default:
> > + break; /* No BPF policy */
> > + }
> > +
>
> I think forcing bpf prog to look at every task is going to be limiting
> long term.
> It's more flexible to invoke bpf prog from out_of_memory()
> and if it doesn't choose a task then fallback to select_bad_process().
> I believe that's what Roman was proposing.
> bpf can choose to iterate memcg or it might have some side knowledge
> that there are processes that can be set as oc->chosen right away,
> so it can skip the iteration.

This is certainly possible but I am worried this will lead to a lot of
duplication. There are common tasks that all/most oom victim selection
implementations should do. First of all they should make sure that the
victim is belonging to the oom domain. Arguably it should be also aware
of ongoing oom victim tear down to prevent from overkilling. Proper oom
victim reference counting handling. Most people are not even aware of
those things. Do we really want all those to be re-invented - most
likely incorrectly?

Advantage of reusing oom_evaluate_task is that all that can be avoided.
Iterating over tasks with a pre-defined oom-victim sure sounds like
unnecessary and wasted CPU cycles but if you want to prevent
over-killing this is still necessary. As the oom killer can be invoked
really rapidly (before the victim has a chance to die) I believe this is
a very useful feature.
--
Michal Hocko
SUSE Labs