Re: [patch] mm, memcg: add oom killer delay

From: David Rientjes
Date: Wed Jun 05 2013 - 02:40:45 EST


On Tue, 4 Jun 2013, Michal Hocko wrote:

> > I'm not sure a userspace oom notifier would want to keep a
> > preallocated buffer around that is mlocked in memory for all possible
> > lengths of this file.
>
> Well, an oom handler which allocates memory under the same restricted
> memory doesn't make much sense to me. Tracking all kmem allocations
> makes it almost impossible to implement a non-trivial handler.
>

This isn't only about oom notifiers that reside in the same oom memcg,
they can be at a higher level or a different subtree entirely. The point
is that they can be unresponsive either because userspace is flaky, its
oom itself, or we do track all slab. The kernel is the only thing in the
position to fix the issue after a sensible user-defined amount of time has
elapsed, and that's what this patch is.

> OK, maybe I just wasn't clear enough or I am missing your point. Your
> users _can_ implement and register their oom handlers. But as your
> requirements are rather benevolent for handlers implementation you would
> have a global watchdog which would sit on the oom_control of those
> groups (which are allowed to have own handlers - all of them in your
> case I guess) and trigger (user defined/global) timeout when it gets a
> notification. If the group was under oom always during the timeout then
> just disable oom_control until oom is settled (under_oom is 0).
>
> Why wouldn't something like this work for your use case?
>

For the aforementioned reason that we give users the ability to manipulate
their own memcg trees and userspace is untrusted. Their oom notifiers
cannot be run as root, not only because of security but also because it
would not properly isolate their memory usage to their memcg tree.

> Hohmm, so you are insisting on something that can be implemented in the
> userspace and put it into the kernel because it is more convenient for
> you and your use case. This doesn't sound like a way for accepting a
> feature.
>

I don't think you yet understand the problem, which is probably my fault.
I'm not insisting this be implemented in the kernel, I'm saying it's not
possible to do it in userspace. Your idea of a timeout implemented in
userspace doesn't work in practice: userspace is both untrusted and cannot
be guaranteed to be perfect and always wakeup, get the information it does
according to its implementation, and issue a SIGKILL.

This is the result of memcg allowing users to disable the oom killer
entirely for a memcg, which is still ridiculous, because if the user
doesn't respond then you've wasted all that memory and cannot get it back
without admin intervention or a reboot. There are no other "features" in
the kernel that put such a responsibility on a userspace process such that
if it doesn't work then the entire memcg deadlocks forever without admin
intervention. We need a failsafe in the kernel.

Real users like this cannot run as root, and we cannot run in the root
memcg without charging that memory usage to the user's container for that
share of a global resource. Real users do have to tollerate buggy and
flaky userspace implementations that cannot be guaranteed to run or do
what they are supposed to do. It's too costly of a problem to not address
with a failsafe. I speak strictly from experience on this.

> And yes we should make memcg oom handling less deadlock prone and
> Johannes' work in this thread is a good step forward.

The long-term solution to that, which I already have patches for, is
something you would cringe even more at: memcg memory reserves that are
shared with per-zone memory reserves that get the global oom killer to
kill off that process without notification in the case the memcg memory
reserves cause a global oom.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/