Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization fromoom_badness()

From: David Rientjes
Date: Tue Sep 07 2010 - 23:13:20 EST


On Wed, 8 Sep 2010, KOSAKI Motohiro wrote:

> Of cource, there is simply just zero justification. Who ask google usage?
> Every developer have their own debugging and machine administate patches.
> Don't you concern why we don't push them into upstream? only one usage is
> not good reason to make kernel bloat. Need to generalize.
>

/proc/pid/oom_adj was introduced before cpusets or memcg, so we were
dealing with a static amount of system resources that would not change out
from underneath an application. Although that's not a defense of using a
bitshift on a heuristic that included many arbitrary selections, it was
reasonable to make it only a scalar that didn't consider the amount of
resources that an application was allowed to access.

As time moved on, cgroups such as cpusets and memcg were introduced (and
mempolicies became much more popular as larger NUMA machines became more
popular in the industry) that bound or restricted the amount of memory
that an aggregate of tasks could access. Each of those methods may change
the amount of memory resources that an application has available to it at
any time without knowledge of that application, job scheduler, or system
daemon. Therefore, it's important to define a more powerful oom_adj
mechanism that can properly attribute the oom killing priority of a task
in comparison to others that are bound to the same resources without
causing a regression on users who don't use those mechanisms that allow
that working set size to be dynamic. That's what oom_score_adj does.

> > We are certainly looking forward to using this when 2.6.36 is released
> > since we work with both cpusets and memcg.
>
> Could you please be serious? We are not making a sand castle, we are making
> a kernel. You have to understand the difference of them. Zero user feature
> trial don't have any good reason to make userland breakage.
>

With respect to memory isolation and the oom killer, we want to follow the
upstream behavior as much as possible. We are looking forward to this
interface being available in 2.6.36 and will begin to actively use it once
it is released. So although there is no existing user today, there will
be when 2.6.36 is released. I also hope that other users of cpusets or
memcg will find it helpful to define oom killing priority with a unit
rather than a bitshift and that understands the dynamic nature of resource
isolation that they both imply.

> > Basically, it comes down to this: few users actually tune their oom
> > killing priority, period. That's partly because they accept the oom
> > killer's heuristics to kill a memory-hogging task or use panic_on_oom, or
> > because the old interface, /proc/pid/oom_adj, had no unit and no logical
> > way of using it other than polarizing it (either +15 or -17).
>
> Unrelated. We already have oom notifier. we have no reason to add new knob
> even though oom_adj is suck. We are not necessary to break userland application.
>

There is no generic oom notifier solution other than what is implemented
in the memory controller, and that comes at a cost of roughly 1% of system
RAM since that's the amount of metadata that the memcg requires.
oom_score_adj works for cpusets, memcg, and mempolicies.

Now you may insist that the fact that very few users actually use
/proc/pid/oom_adj for _anything_ other than -17, -15 or +15 is unrelated,
but in fact it provides a very nice context for your objections that it
introduces all sorts of regressions and bugs that will break everybody's
system. Show me a single example of an application that tunes its oom_adj
value based on any formula whatsoever that includes its own anticipated
RAM usage or system capacity (which is required to make the bitshift make
any sense).

> More importantly, We already have oom notifier. and It is most powerful
> infrastructure. It can be constructed any oom policy freely. It's not
> restricted kernel implementaion.
>

A generic oom notifier would be very nice to have in the kernel that
isn't dependent on memory controller. Unfortunately, many users cannot
incur the 1% of memory penalty that it comes with.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/