Re: [v8 0/4] cgroup-aware OOM killer

From: Roman Gushchin
Date: Tue Sep 26 2017 - 08:13:47 EST


On Tue, Sep 26, 2017 at 01:21:34PM +0200, Michal Hocko wrote:
> On Tue 26-09-17 11:59:25, Roman Gushchin wrote:
> > On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote:
> > > On Mon 25-09-17 19:15:33, Roman Gushchin wrote:
> > > [...]
> > > > I'm not against this model, as I've said before. It feels logical,
> > > > and will work fine in most cases.
> > > >
> > > > In this case we can drop any mount/boot options, because it preserves
> > > > the existing behavior in the default configuration. A big advantage.
> > >
> > > I am not sure about this. We still need an opt-in, ragardless, because
> > > selecting the largest process from the largest memcg != selecting the
> > > largest task (just consider memcgs with many processes example).
> >
> > As I understand Johannes, he suggested to compare individual processes with
> > group_oom mem cgroups. In other words, always select a killable entity with
> > the biggest memory footprint.
> >
> > This is slightly different from my v8 approach, where I treat leaf memcgs
> > as indivisible memory consumers independent on group_oom setting, so
> > by default I'm selecting the biggest task in the biggest memcg.
>
> My reading is that he is actually proposing the same thing I've been
> mentioning. Simply select the biggest killable entity (leaf memcg or
> group_oom hierarchy) and either kill the largest task in that entity
> (for !group_oom) or the whole memcg/hierarchy otherwise.

He wrote the following:
"So I'm leaning toward the second model: compare all oomgroups and
standalone tasks in the system with each other, independent of the
failed hierarchical control structure. Then kill the biggest of them."

>
> > While the approach suggested by Johannes looks clear and reasonable,
> > I'm slightly concerned about possible implementation issues,
> > which I've described below:
> >
> > >
> > > > The only thing, I'm slightly concerned, that due to the way how we calculate
> > > > the memory footprint for tasks and memory cgroups, we will have a number
> > > > of weird edge cases. For instance, when putting a single process into
> > > > the group_oom memcg will alter the oom_score significantly and result
> > > > in significantly different chances to be killed. An obvious example will
> > > > be a task with oom_score_adj set to any non-extreme (other than 0 and -1000)
> > > > value, but it can also happen in case of constrained alloc, for instance.
> > >
> > > I am not sure I understand. Are you talking about root memcg comparing
> > > to other memcgs?
> >
> > Not only, but root memcg in this case will be another complication. We can
> > also use the same trick for all memcg (define memcg oom_score as maximum oom_score
> > of the belonging tasks), it will turn group_oom into pure container cleanup
> > solution, without changing victim selection algorithm
>
> I fail to see the problem to be honest. Simply evaluate the memcg_score
> you have so far with one minor detail. You only check memcgs which have
> tasks (rather than check for leaf node check) or it is group_oom. An
> intermediate memcg will get a cumulative size of the whole subhierarchy
> and then you know you can skip the subtree because any subtree can be larger.
>
> > But, again, I'm not against approach suggested by Johannes. I think that overall
> > it's the best possible semantics, if we're not taking some implementation details
> > into account.
>
> I do not see those implementation details issues and let me repeat do
> not develop a semantic based on implementation details.

There are no problems in "select the biggest leaf or group_oom memcg, then
kill the biggest task or all tasks depending on group_oom" approach,
which you're describing. Comparing tasks and memcgs (what Johannes is suggesting)
may have some issues.

Thanks!