Re: [merged]mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from-mm tree

From: Johannes Weiner
Date: Wed Nov 27 2013 - 22:13:44 EST


On Wed, Nov 27, 2013 at 06:38:31PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
>
> > > The task that is bypassing the memcg charge to the root memcg may not be
> > > the process that is chosen by the oom killer, and it's possible the amount
> > > of memory freed by killing the victim is less than the amount of memory
> > > bypassed.
> >
> > That's true, though unlikely.
> >
>
> Well, the "goto bypass" allows it and it's trivial to cause by
> manipulating /proc/pid/oom_score_adj values to prefer processes with very
> little rss. It will just continue looping and killing processes as they
> are forked and never cause the memcg to free memory below its limit. At
> least the "goto nomem" allows us to free some memory instead of leaking to
> the root memcg.

Yes, that's the better way of doing it, I'll send the patch. Thanks.

> > > Were you targeting these to 3.13 instead? If so, it would have already
> > > appeared in 3.13-rc1 anyway. Is it still a work in progress?
> >
> > I don't know how to answer this question.
> >
>
> It appears as though this work is being developed in Linus's tree rather
> than -mm, so I'm asking if we should consider backing some of it out for
> 3.14 instead.

The changes fix a deadlock problem. Are they creating problems that
are worse than deadlocks, that would justify their revert?

> > > Should we be checking mem_cgroup_margin() here to ensure
> > > task_in_memcg_oom() is still accurate and we haven't raced by freeing
> > > memory?
> >
> > We would have invoked the OOM killer long before this point prior to
> > my patches. There is a line we draw and from that point on we start
> > killing things. I tried to explain multiple times now that there is
> > no race-free OOM killing and I'm tired of it. Convince me otherwise
> > or stop repeating this non-sense.
> >
>
> In our internal kernel we call mem_cgroup_margin() with the order of the
> charge immediately prior to sending the SIGKILL to see if it's still
> needed even after selecting the victim. It makes the race smaller.
>
> It's obvious that after the SIGKILL is sent, either from the kernel or
> from userspace, that memory might subsequently be freed or another process
> might exit before the process killed could even wake up. There's nothing
> we can do about that since we don't have psychic abilities. I think we
> should try to reduce the chance for unnecessary oom killing as much as
> possible, however.

Since we can't physically draw a perfect line, we should strive for a
reasonable and intuitive line. After that it's rapidly diminishing
returns. Killing something after that much reclaim effort without
success is a completely reasonable and intuitive line to draw. It's
also the line that has been drawn a long time ago and we're not
breaking this because of a micro optmimization.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/