Re: [PATCH] mm/oom_kill: count global and memory cgroup oom kills

From: Konstantin Khlebnikov
Date: Tue May 23 2017 - 06:32:27 EST




On 23.05.2017 10:49, David Rientjes wrote:
On Mon, 22 May 2017, Konstantin Khlebnikov wrote:

Nope, they are different. I think we should rephase documentation somehow

low - count of reclaims below low level
high - count of post-allocation reclaims above high level
max - count of direct reclaims
oom - count of failed direct reclaims
oom_kill - count of oom killer invocations and killed processes


In our kernel, we've maintained counts of oom kills per memcg for years as
part of memory.oom_control for memcg v1, but we've also found it helpful
to complement that with another count that specifies the number of
processes oom killed that were attached to that exact memcg.

In your patch, oom_kill in memory.oom_control specifies that number of oom
events that resulted in an oom kill of a process from that hierarchy, but
not the number of processes killed from a specific memcg (the difference
between oc->memcg and mem_cgroup_from_task(victim)). Not sure if you
would also find it helpful.


This is worth addition. Let's call it "oom_victim" for short.

It allows to locate leaky part if they are spread over sub-containers within common limit.
But doesn't tell which limit caused this kill. For hierarchical limits this might be not so easy.

I think oom_kill better suits for automatic actions - restart affected hierarchy, increase limits, e.t.c.
But oom_victim allows to determine container affected by global oom killer.

So, probably it's worth to merge them together and increment oom_kill by global killer for victim memcg:

if (!is_memcg_oom(oc)) {
count_vm_event(OOM_KILL);
mem_cgroup_count_vm_event(mm, OOM_KILL);
} else
mem_cgroup_event(oc->memcg, OOM_KILL);