Re: [RFC PATCH 0/5] mm: Select victim memcg using BPF_OOM_POLICY

From: Roman Gushchin
Date: Fri Jul 28 2023 - 00:30:13 EST


On Thu, Jul 27, 2023 at 10:15:16AM +0200, Michal Hocko wrote:
> On Thu 27-07-23 15:36:27, Chuyi Zhou wrote:
> > This patchset tries to add a new bpf prog type and use it to select
> > a victim memcg when global OOM is invoked. The mainly motivation is
> > the need to customizable OOM victim selection functionality so that
> > we can protect more important app from OOM killer.
>
> This is rather modest to give an idea how the whole thing is supposed to
> work. I have looked through patches very quickly but there is no overall
> design described anywhere either.
>
> Please could you give us a high level design description and reasoning
> why certain decisions have been made? e.g. why is this limited to the
> global oom sitation, why is the BPF program forced to operate on memcgs
> as entities etc...
> Also it would be very helpful to call out limitations of the BPF
> program, if there are any.

One thing I realized recently: we don't have to make a victim selection
during the OOM, we [almost always] can do it in advance.

Kernel OOM's must guarantee the forward progress under heavy memory pressure
and it creates a lot of limitations on what can and what can't be done in
these circumstances.

But in practice most policies except maybe those which aim to catch very fast
memory spikes rely on things which are fairly static: a logical importance of
several workloads in comparison to some other workloads, "age", memory footprint
etc.

So I wonder if the right path is to create a kernel interface which allows
to define a OOM victim (maybe several victims, also depending on if it's
a global or a memcg oom) and update it periodically from an userspace.
In fact, the second part is already implemented by tools like oomd, systemd-oomd etc.
Someone might say that the first part is also implemented by the oom_score
interface, but I don't think it's an example of a convenient interface.
It's also not a memcg-level interface.

Just some thoughts.

Thanks!