Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

From: David Rientjes
Date: Tue Oct 10 2017 - 17:13:11 EST


On Tue, 10 Oct 2017, Roman Gushchin wrote:

> > This seems to unfairly bias the root mem cgroup depending on process size.
> > It isn't treated fairly as a leaf mem cgroup if they are being compared
> > based on different criteria: the root mem cgroup as (mostly) the largest
> > rss of a single process vs leaf mem cgroups as all anon, unevictable, and
> > unreclaimable slab pages charged to it by all processes.
> >
> > I imagine a configuration where the root mem cgroup has 100 processes
> > attached each with rss of 80MB, compared to a leaf cgroup with 100
> > processes of 1MB rss each. How does this logic prevent repeatedly oom
> > killing the processes of 1MB rss?
> >
> > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't
> > quite fair, it can simply hide large processes from being selected. Users
> > who configure cgroups in a unified hierarchy for other resource
> > constraints are penalized for this choice even though the mem cgroup with
> > 100 processes of 1MB rss each may not be limited itself.
> >
> > I think for this comparison to be fair, it requires accounting for the
> > root mem cgroup itself or for a different accounting methodology for leaf
> > memory cgroups.
>
> This is basically a workaround, because we don't have necessary stats for root
> memory cgroup. If we'll start gathering them at some point, we can change this
> and treat root memcg exactly as other leaf cgroups.
>

I understand why it currently cannot be an apples vs apples comparison
without, as I suggest in the last paragraph, that the same accounting is
done for the root mem cgroup, which is intuitive if it is to be considered
on the same basis as leaf mem cgroups.

I understand for the design to work that leaf mem cgroups and the root mem
cgroup must be compared if processes can be attached to the root mem
cgroup. My point is that it is currently completely unfair as I've
stated: you can have 10000 processes attached to the root mem cgroup with
rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each
and the oom killer is going to target the leaf mem cgroup as a result of
this apples vs oranges comparison.

In case it's not clear, the 10000 processes of 80MB rss each is the most
likely contributor to a system-wide oom kill. Unfortunately, the
heuristic introduced by this patchset is broken wrt a fair comparison of
the root mem cgroup usage.

> Or, if someone will come with an idea of a better approximation, it can be
> implemented as a separate enhancement on top of the initial implementation.
> This is more than welcome.
>

We don't need a better approximation, we need a fair comparison. The
heuristic that this patchset is implementing is based on the usage of
individual mem cgroups. For the root mem cgroup to be considered
eligible, we need to understand its usage. That usage is _not_ what is
implemented by this patchset, which is the largest rss of a single
attached process. This, in fact, is not an "approximation" at all. In
the example of 10000 processes attached with 80MB rss each, the usage of
the root mem cgroup is _not_ 80MB.

I'll restate that oom killing a process is a last resort for the kernel,
but it also must be able to make a smart decision. Targeting dozens of
1MB processes instead of 80MB processes because of a shortcoming in this
implementation is not the appropriate selection, it's the opposite of the
correct selection.

> > I'll reiterate what I did on the last version of the patchset: considering
> > only leaf memory cgroups easily allows users to defeat this heuristic and
> > bias against all of their memory usage up to the largest process size
> > amongst the set of processes attached. If the user creates N child mem
> > cgroups for their N processes and attaches one process to each child, the
> > _only_ thing this achieved is to defeat your heuristic and prefer other
> > leaf cgroups simply because those other leaf cgroups did not do this.
> >
> > Effectively:
> >
> > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> >
> > will radically shift the heuristic from a score of all anonymous +
> > unevictable memory for all processes to a score of the largest anonymous +
> > unevictable memory for a single process. There is no downside or
> > ramifaction for the end user in doing this. When comparing cgroups based
> > on usage, it only makes sense to compare the hierarchical usage of that
> > cgroup so that attaching processes to descendants or splitting the
> > implementation of a process into several smaller individual processes does
> > not allow this heuristic to be defeated.
>
> To all previously said words I can only add that cgroup v2 allows to limit
> the amount of cgroups in the sub-tree:
> 1a926e0bbab8 ("cgroup: implement hierarchy limits").
>

So the solution to

for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done

evading all oom kills for your mem cgroup is to limit the number of
cgroups that can be created by the user? With a unified cgroup hierarchy,
that doesn't work well if I wanted to actually constrain these individual
processes to different resource limits like cpu usage. In fact, the user
may not know it is effectively evading the oom killer entirely because it
has constrained the cpu of individual processes because its a side-effect
of this heuristic.


You chose not to respond to my reiteration of userspace having absolutely
no control over victim selection with the new heuristic without setting
all processes to be oom disabled via /proc/pid/oom_score_adj. If I have a
very important job that is running on a system that is really supposed to
use 80% of memory, I need to be able to specify that it should not be oom
killed based on user goals. Setting all processes to be oom disabled in
the important mem cgroup to avoid being oom killed unless absolutely
necessary in a system oom condition is not a robust solution: (1) the mem
cgroup livelocks if it reaches its own mem cgroup limit and (2) the system
panic()'s if these preferred mem cgroups are the only consumers left on
the system. With overcommit, both of these possibilities exist in the
wild and the problem is only a result of the implementation detail of this
patchset.

For these reasons: unfair comparison of root mem cgroup usage to bias
against that mem cgroup from oom kill in system oom conditions, the
ability of users to completely evade the oom killer by attaching all
processes to child cgroups either purposefully or unpurposefully, and the
inability of userspace to effectively control oom victim selection:

Nacked-by: David Rientjes <rientjes@xxxxxxxxxx>

> > This is racy because mem_cgroup_select_oom_victim() found an eligible
> > oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible
> > process but mem_cgroup_scan_task(oc->chosen_memcg) did not. It means if a
> > process cannot be killed because of oom_unkillable_task(), the only
> > eligible processes moved or exited, or the /proc/pid/oom_score_adj of the
> > eligible processes changed, we end up falling back to the complete
> > tasklist scan. It would be better for oom_evaluate_memcg() to consider
> > oom_unkillable_task() and also retry in the case where
> > oom_kill_memcg_victim() returns NULL.
>
> I agree with you here. The fallback to the existing mechanism is implemented
> to be safe for sure, especially in a case of a global OOM. When we'll get
> more confidence in cgroup-aware OOM killer reliability, we can change this
> behavior. Personally, I would prefer to get rid of looking at all tasks just
> to find a pre-existing OOM victim, but it might be quite tricky to implement.
>

I'm not sure what this has to do with confidence in this patchset's
reliability? The race obviously exists: mem_cgroup_select_oom_victim()
found an eligible process in oc->chosen_memcg but it was either ineligible
later because of oom_unkillable_task(), it moved, or it exited. It's a
race. For users who opt-in to this new heuristic, they should not be
concerned with a process exiting and thus killing a completely unexpected
process from an unexpected memcg when it should be possible to retry and
select the correct victim.

It's much better to document and state to the user what they are opting-in
to and clearly define how a victim is chosen with the new heuristic and
then implement that so it works correctly.