Re: Re: [RFC PATCH 1/2] mm, oom: Introduce bpf_select_task

From: Roman Gushchin
Date: Tue Aug 15 2023 - 15:53:38 EST


On Thu, Aug 10, 2023 at 12:00:36PM +0800, Abel Wu wrote:
> On 8/9/23 3:53 PM, Michal Hocko wrote:
> > On Tue 08-08-23 14:41:17, Roman Gushchin wrote:
> > > It would be also nice to come up with some practical examples of bpf programs.
> > > What are meaningful scenarios which can be covered with the proposed approach
> > > and are not covered now with oom_score_adj.
> >
> > Agreed here as well. This RFC serves purpose of brainstorming on all of
> > this.
> >
> > There is a fundamental question whether we need BPF for this task in the
> > first place. Are there any huge advantages to export the callback and
> > allow a kernel module to hook into it?
>
> The ancient oom-killer largely depends on memory usage when choosing
> victims, which might not fit the need of modern scenarios. It's common
> nowadays that multiple workloads (tenants) with different 'priorities'
> run together, and the decisions made by the oom-killer doesn't always
> obey the service level agreements.
>
> While the oom_score_adj only adjusts the usage-based decisions, so it
> can hardly be translated into 'priority' semantic. How can we properly
> configure it given that we don't know how much memory the workloads
> will use? It's really hard for a static strategy to deal with dynamic
> provision. IMHO the oom_score_adj is just another demon.
>
> Reworking the oom-killer's internal algorithm or patching some random
> metrics may satisfy the immediate needs, but for the next 10 years? I
> doubt it. So I think we do need the flexibility to bypass the legacy
> usage-based algorithm, through bpf or pre-select interfaces.

I agree in general, but I wouldn't call the existing implementation a legacy
or obsolete. It's all about trade-offs. The goal of the existing implementation
is to guarantee the forward progress without killing any processes prematurely.
And it does it relatively well.

Userspace oom killers (e.g. oomd) on top of PSI were initially created to
solve the problem of memory thrashing: having a system which is barely making
anything useful, but not stuck enough for the OOM killer to kick in.
But also they were able to provide a much better flexibility. The downside -
they can't be as reliable as the in-kernel OOM killer.

Bpf or a pre-select interface can in theory glue them together: make sure that
a user has a flexibility to choose the OOM victim without compromising on the
reliability. Pre-select interface could be preferable if all the logic is
already implemented in userspace, but might be slightly less accurate if some
statistics (e.g. memory usage) is used for the determination of the victim.
Bpf approach will require re-implementing the logic, but potentially is more
powerful due to a fast access to a lot of kernel data.

Thanks!