Re: [PATCH v13 0/7] cgroup-aware OOM killer

From: David Rientjes
Date: Fri Jan 12 2018 - 17:03:19 EST


On Thu, 11 Jan 2018, Roman Gushchin wrote:

> Summarizing all this, following the hierarchy is good when it reflects
> the "importance" of cgroup's memory for a user, and bad otherwise.
> In generic case with unified hierarchy it's not true, so following
> the hierarchy unconditionally is bad.
>
> Current version of the patchset allows common evaluation of cgroups
> by setting memory.groupoom to true. The only limitation is that it
> also changes the OOM action: all belonging processes will be killed.
> If we really want to preserve an option to evaluate cgroups
> together without forcing "kill all processes" action, we can
> convert memory.groupoom from being boolean to the multi-value knob.
> For instance, "disabled" and "killall". This will allow to add
> a third state ("evaluate", for example) later without breaking the API.
>

No, this isn't how kernel features get introduced. We don't design a new
kernel feature with its own API for a highly specialized usecase and then
claim we'll fix the problems later. Users will work around the
constraints of the new feature, if possible, and then we can end up
breaking them later. Or, we can pollute the mem cgroup v2 filesystem with
even more tunables to cover up for mistakes in earlier designs.

The key point to all three of my objections: extensibility.

Both you and Michal have acknowledged blantently obvious shortcomings in
the design. We do not need to merge an incomplete patchset that forces us
into a corner later so that we need to add more tunables and behaviors.
It's also an incompletely documented patchset: the user has *no* insight
other than reading the kernel code that with this functionality their
/proc/pid/oom_score_adj values are effective when attached to the root
cgroup and ineffective when attached to a non-root cgroup, regardless of
whether the memory controller is enabled or not.

Extensibility when designing a new kernel feature is important. If you
see shortcomings, which I've enumerated and have been acknowledged, it
needs to be fixed.

The main consideration here is two choices:

(1) any logic to determine process(es) to be oom killed belongs only in
userspace, and has no place in the kernel, or

(2) such logic is complete, documented, extensible, and is generally
useful.

This patchset obviously is neither. I believe we can move toward (2), but
it needs to be sanely designed and implemented such that it addresses the
concerns I've enumerated that we are all familiar with.

The solution is to define the oom policy as a trait of the mem cgroup
itself. That needs to be complemented with a per mem cgroup value member
to specify the policy for child mem cgroups. We don't need any new
memory.groupoom, which depends on only this policy decision and is
otherwise a complete no-op, and we don't need any new mount option.

It's quite obvious the system cannot have a single cgroup aware oom
policy. That's when you get crazy inconsistencies and functionally broken
behavior such as influencing oom kill selection based on whether a cgroup
distributed processes over a set of child cgroups, even if they do not
have memory in their memory.subtree_control. It's also where you can give
the ability to user to still prefer to protect important cgroups or bias
against non-important cgroups.

The point is that the functionality needs to be complete and it needs to
be extensible.

I'd propose to introduce a new memory.oom_policy and a new
memory.oom_value. All cgroups have these files, but memory.oom_value is a
no-op for the root mem cgroup, just as you can't limit the memory usage on
the root mem cgroup.

memory.oom_policy defines the policy for selection when that mem cgroup is
oom; system oom conditions refer to the root mem cgroup's
memory.oom_policy.

memory.oom_value defines any parameter that child mem cgroups should be
considered with when effecting that policy.

There's three memory.oom_policy values I can envision at this time, but it
can be extended in the future:

- "adj": cgroup aware just as you have implemented based on how large a
cgroup is. oom_value can be used to define any adjustment made to that
usage so that userspace can protect important cgroups or bias against
non-important cgroups (I'd suggest using that same semantics as
oom_score_adj for consistency, but wouldn't object to using a byte
value).

- "hierarchy": same as "adj" but based on hierarchical usage. oom_value
has the same semantics as "adj". This prevents users from
intentionally or unintentionally affecting oom kill selection by
limiting groups of their processes by the memory controller, or any
other cgroup controller.

- "priority": cgroup aware just as you implemented memory.oom_priority in
the past. oom_value is a strict priority value where usage is not
considered and only the priority of the subtree is compared.

With cgroup v2 sematics of no internal process constraint, this is
extremely straight forward. All of your oom evaluation function can be
reused with a simple comparison based on the policy to score individual
cgroups. In the simplest policy, "priority", this is like a 10 line
function, but extremely useful to userspace.

This allows users to have full power over the decisionmaking in every
subtree wrt oom kill selection and doesn't regress or allow for any
pitfalls of the current patchset. The goal is not to have one single oom
policy for the entire system, but define the policy that makes useful
sense. This is how an extensible feature is designed and does not require
any further pollution of the mem cgroup filesystem.

If you have additional features such as groupoom, you can make
memory.oom_policy comma delimited, just as vmpressure modes are comma
delimited. You would want to use "adj,groupoom". We don't need another
file that is pointless in other policy decisions. We don't need a mount
option to lock the entire system into a single methodology.

Cgroup v2 is a very clean interface and I think it's the responsibility of
every controller to maintain that. We should not fall into a cgroup v1
mentality which became very difficult to make extensible. Let's make a
feature that is generally useful, complete, and empowers the user rather
than push them into a corner with a system wide policy with obvious
downsides.

For these reasons, and the three concerns that I enumerated earlier which
have been acknowledged of obvious shortcoming with this approach:

Nacked-by: David Rientjes <rientjes@xxxxxxxxxx>

I'll be happy to implement the core functionality that allows oom policies
to be written by the user and introduce memory.oom_value, and then rework
the logic defined in your patchset as "adj" by giving the user an optional
way of perferring or biasing that usage in a way that is clearly
documented and extended. Root mem cgroup usage is obviously wrong in this
current patchset since it uses oom_score_adj whereas leaf cgroups do not,
so that will be fixed. But I'll ask that the current patchset is removed
from -mm since it has obvious downsides, pollutes the mem cgroup v2
filesystem, is not extensible, is not documented wrt to oom_score_adj,
allows evasion of the heuristic, and doesn't allow the user to have any
power in the important decision of which of their important processes is
oom killed such that this feature is not useful outside very specialized
usecases.