Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

From: David Rientjes
Date: Tue Aug 15 2017 - 17:47:22 EST


On Tue, 15 Aug 2017, Roman Gushchin wrote:

> > I'm curious about the decision made in this conditional and how
> > oom_kill_memcg_member() ignores task->signal->oom_score_adj. It means
> > that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it
> > should otherwise be disabled.
> >
> > It's undocumented in the changelog, but I'm questioning whether it's the
> > right decision. Doesn't it make sense to kill all tasks that are not oom
> > disabled, and allow the user to still protect certain processes by their
> > /proc/pid/oom_score_adj setting? Otherwise, there's no way to do that
> > protection without a sibling memcg and its own reservation of memory. I'm
> > thinking about a process that governs jobs inside the memcg and if there
> > is an oom kill, it wants to do logging and any cleanup necessary before
> > exiting itself. It seems like a powerful combination if coupled with oom
> > notification.
>
> Good question!
> I think, that an ability to override any oom_score_adj value and get all tasks
> killed is more important, than an ability to kill all processes with some
> exceptions.
>

I'm disagreeing because getting all tasks killed is not necessarily
something that only the kernel can do. If all processes are oom disabled,
that's a configuration issue done by sysadmin and the kernel should decide
to kill the next largest memory cgroup or lower priority memory cgroup.
It's not killing things like sshd that intentionally oom disable
themselves.

You could argue that having an oom disabled process attached to these
memcgs in the first place is also a configuration issue, but the problem
is that in cgroup v2 with a restriction on processes only being attached
at the leaf cgroups that there is no competition for memory in this case.
I must assign memory resources to that sshd, or "Activity Manager"
described by the cgroup v1 documentation, just to prevent it from being
killed.

I think the latter of what you describe, killing all processes with some
exceptions, is actually quite powerful. I can guarantee that processes
that set themselves to oom disabled are really oom disabled and I don't
need to work around that in the cgroup hierarchy only because of this
caveat. I can also oom disable my Activity Manger that wants to wait on
oom notification and collect the oom kill logs, raise notifications, and
perhaps restart the process that it manages.

> In your example someone still needs to look after the remaining process,
> and kill it after some timeout, if it will not quit by itself, right?
>

No, it can restart the process that was oom killed; or it can be sshd and
I can still ssh into my machine.

> The special treatment of the -1000 value (without oom_kill_all_tasks)
> is required only to not to break the existing setups.
>

I think as a general principle that allowing an oom disabled process to be
oom killed is incorrect and if you really do want these to be killed, then
(1) either your oom_score_adj is already wrong or (2) you can wait on oom
notification and exit.