Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

From: David Rientjes
Date: Fri Oct 13 2017 - 17:31:43 EST


On Fri, 13 Oct 2017, Roman Gushchin wrote:

> > Think about it in a different way: we currently compare per-process usage
> > and userspace has /proc/pid/oom_score_adj to adjust that usage depending
> > on priorities of that process and still oom kill if there's a memory leak.
> > Your heuristic compares per-cgroup usage, it's the cgroup-aware oom killer
> > after all. We don't need a strict memory.oom_priority that outranks all
> > other sibling cgroups regardless of usage. We need a memory.oom_score_adj
> > to adjust the per-cgroup usage. The decisionmaking in your earlier
> > example would be under the control of C/memory.oom_score_adj and
> > D/memory.oom_score_adj. Problem solved.
> >
> > It also solves the problem of userspace being able to influence oom victim
> > selection so now they can protect important cgroups just like we can
> > protect important processes today.
> >
> > And since this would be hierarchical usage, you can trivially infer root
> > mem cgroup usage by subtraction of top-level mem cgroup usage.
> >
> > This is a powerful solution to the problem and gives userspace the control
> > they need so that it can work in all usecases, not a subset of usecases.
>
> You're right that per-cgroup oom_score_adj may resolve the issue with
> too strict semantics of oom_priorities. But I believe nobody likes
> the existing per-process oom_score_adj interface, and there are reasons behind.

The previous heuristic before I rewrote the oom killer used
/proc/pid/oom_adj which acted as a bitshift on mm->total_vm, which was a
much more difficult interface to use as I'm sure you can imagine. People
ended up only using it to polarize selection: either -17 to oom disable a
process, -16 to bias against it, and 15 to prefer it. Nobody used
anything in between and I worked with openssh, udev, kde, and chromium to
get a consensus on the oom_score_adj semantics. People do use it to
protect against memory leaks and to prevent oom killing important
processes when something else can be sacrificed, unless there's a leak.

> Especially in case of memcg-OOM, getting the idea how exactly oom_score_adj
> will work is not trivial.

I suggest defining it in the terms used for previous iterations of the
patchset: do hierarchical scoring so that each level of the hierarchy has
usage information for each subtree. You can get root mem cgroup usage
with complete fairness by subtraction with this method. When comparing
usage at each level of the hierarchy, you can propagate the eligibility of
processes in that subtree much like you do today. I agree with your
change to make the oom killer a no-op if selection races with the actual
killing rather than falling back to the old heuristic. I'm happy to help
add a Tested-by once we settle the other issues with that change.

At each level, I would state that memory.oom_score_adj has the exact same
semantics as /proc/pid/oom_score_adj. In this case, it would simply be
defined as a proportion of the parent's limit. If the hierarchy is
iterated starting at the root mem cgroup for system ooms and at the root
of the oom memcg for memcg ooms, this should lead to the exact same oom
killing behavior, which is desired.

This solution would address the three concerns that I had: it allows the
root mem cgroup to be compared fairly with leaf mem cgroups (with the
bonus of not requiring root mem cgroup accounting thanks to your heuristic
using global vmstats), it allows userspace to influence the decisionmaking
so that users can protect cgroups that use 50% of memory because they are
important, and it completely avoids users being able to change victim
selection simply by creating child mem cgroups.

This would be a very powerful patchset.