Re: [PATCH] Revert oom rewrite series

From: David Rientjes
Date: Tue Nov 16 2010 - 19:25:49 EST


On Wed, 17 Nov 2010, Bodo Eggert wrote:

> The old oom killer's task was to guess the best victim to kill. For me, it
> did a good job (but the system kept thrashing for too long until it kicked
> the offender). Looking at CAP_SYS_RESOURCE was one way to recognize
> important processes.
>

CAP_SYS_RESOURCE does not imply the task is important.

There's a problem when the kernel is oom; killing a thread that is getting
work done is one of the most serious remedies the kernel will ever do to
allow forward progress. In almost all scenarios (except in some cpuset or
memcg configurations), it's a userspace configuration issue that exhausts
memory and the VM finds no other alternative. CAP_SYS_RESOURCE threads
have access to unbounded amounts of resources and thus can use an
extremely large amount of memory very quickly and at a detriment to other
threads that may be as important to more important. Considering them any
different is an unsubstantiated and undefined behavior that should not be
considered in the heuristic _unless_ the administrator or the task itself
tells the kernel via oom_score_adj of its priority.

> > The old heuristics were a mixture of arbitrary values that didn't adjust
> > scores based on a unit and would often cause the incorrect task to be
> > targeted because there was no clear goal being achieved. The new
> > heuristic has a solid goal: to identify and kill the most memory-hogging
> > task that is eligible given the context in which the oom occurs. If you
> > disagree with that goal and want any of the old heursitics reintroduced,
> > please show that it makes sense in the oom killer.
>
> The first old OOM killer did the same as you promise the current one does,
> except for your bugfixes. That's why it killed the wrong applications and
> all the heuristics were added until the complaints stopped.
>

No, the old oom killer did not always kill the application that used the
most amount of memory; it considered other factors with arbitrary point
deductions such as nice level, runtime, CAP_SYS_RAWIO, CAP_SYS_RESOURCE,
etc. We had to remove those heuristics internally in older kernels as
well because it would often allow a task to runaway using a massive amount
of memory because of leaks and kill everything else on the system before
targeting the appropriate task. At that point, it left the system with
barely anything running and no work was getting done.

> Off cause I did not yet test your OOM killer, maybe it really is better.
> Heuristics tend to rot and you did much work to make it right.
>
> I don't want the old OOM killer back, but I don't want you to fall
> into the same pits as the pre-old OOM killer used to do.
>

Thanks, and that's why I'm trying to avoid additional heuristics such
CAP_SYS_RESOURCE where the priority is _implied_ rather than _proven_. If
CAP_SYS_RESOURCE was defined to be more preferred to stay alive, then I'd
have no argument; it isn't.

> > > PS) Mapping an exponential value to a linear score is bad. E.g. A
> > > oom_adj of 8 should make an 1-MB-process as likely to kill as
> > > a 256-MB-process with oom_adj=0.
> > >
> >
> > To show that, you would have to show that an application that exists today
> > uses an oom_adj for something other than polarization and is based on a
> > calculation of allowable memory usage. It simply doesn't exist.
>
> No such application should exist because the OOM killer should DTRT.
> oom_adj was supposed to let the sysadmin lower his mission-critical
> DB's score to be just lower than the less-important tasks, or to
> point the kernel to his ever-faulty and easily-restarted browser.
>

oom_score_adj allows use to define when an application is using more
memory than expected and is often helpful in cpuset, memcg, or mempolicy
constrained cases as well. We'd like to be able to say that 30% of
available memory should be discounted from a particular task that is
expected to use 30% more memory than others without getting preferred.
oom_score_adj can do that, oom_adj could not.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/