Re: [PATCH V3] memcg, oom: provide more precise dump info while memcgoom happening

From: Sha Zhengju
Date: Fri Nov 09 2012 - 05:22:58 EST


On 11/09/2012 12:25 AM, Michal Hocko wrote:
On Thu 08-11-12 23:52:47, Sha Zhengju wrote:
[...]
(2) After change
[ 269.225628] mal invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[ 269.225633] mal cpuset=/ mems_allowed=0-1
[ 269.225636] Pid: 4616, comm: mal Not tainted 3.6.0+ #25
[ 269.225637] Call Trace:
[ 269.225647] [<ffffffff8111b9c4>] dump_header+0x84/0xd0
[ 269.225650] [<ffffffff8111c691>] oom_kill_process+0x331/0x350
[ 269.225710] .......(call trace)
[ 269.225713] [<ffffffff81517325>] page_fault+0x25/0x30
[ 269.225716] Task in /1/2 killed as a result of limit of /1
[ 269.225718] memory: usage 511732kB, limit 512000kB, failcnt 5071
[ 269.225720] memory+swap: usage 563200kB, limit 563200kB, failcnt 57
[ 269.225721] kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
[ 269.225722] Memory cgroup stats:cache:8KB rss:511724KB mapped_file:4KB swap:51468KB inactive_anon:265864KB active_anon:245832KB inactive_file:0KB active_file:0KB unevictable:0KB
[ 269.225741] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 269.225757] [ 4554] 0 4554 16626 473 17 25 0 bash
[ 269.225759] [ 4611] 0 4611 103328 90231 208 12260 0 mal
[ 269.225762] [ 4616] 0 4616 103328 32799 88 7562 0 mal
[ 269.225764] Memory cgroup out of memory: Kill process 4611 (mal) score 699 or sacrifice child
[ 269.225766] Killed process 4611 (mal) total-vm:413312kB, anon-rss:360632kB, file-rss:292kB

This version provides more pointed info for memcg in "Memory cgroup stats" section.
Looks much better!

Change log:
v3<--- v2
1. fix towards hierarchy
2. undo rework dump_tasks
v2<--- v1
1. some modification towards hierarchy
2. rework dump_tasks
3. rebased on Michal's mm tree since-3.6

Signed-off-by: Sha Zhengju<handai.szj@xxxxxxxxxx>
---
mm/memcontrol.c | 41 +++++++++++++++++++++++++++++++----------
mm/oom_kill.c | 6 ++++--
2 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0eab7d5..17317fa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
[...]
@@ -1501,8 +1509,8 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
spin_unlock_irqrestore(&memcg->move_lock, *flags);
}

+#define K(x) ((x)<< (PAGE_SHIFT-10))
/**
- * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
No need to remove this just fix it:
* mem_cgroup_print_oom_info: Print OOM information relevant to memory controller.

* @memcg: The memory cgroup that went over limit
* @p: Task that is going to be killed
*
@@ -1520,8 +1528,10 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
*/
static char memcg_name[PATH_MAX];
int ret;
+ struct mem_cgroup *mi;
+ unsigned int i;

- if (!memcg || !p)
+ if (!p)
return;

rcu_read_lock();
@@ -1569,6 +1579,25 @@ done:
res_counter_read_u64(&memcg->kmem, RES_USAGE)>> 10,
res_counter_read_u64(&memcg->kmem, RES_LIMIT)>> 10,
res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
+
+ printk(KERN_INFO "Memory cgroup stats:");
"Memory cgroup hierarchy stats" is probably a better fit with the
current implementation.

+ for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) {
+ long long val = 0;
+ if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account)
+ continue;
+ for_each_mem_cgroup_tree(mi, memcg)
+ val += mem_cgroup_read_stat(mi, i);
+ printk(KERN_CONT "%s:%lldKB ", mem_cgroup_stat_names[i], K(val));
+ }
+
+ for (i = 0; i< NR_LRU_LISTS; i++) {
+ unsigned long long val = 0;
+
+ for_each_mem_cgroup_tree(mi, memcg)
+ val += mem_cgroup_nr_lru_pages(mi, BIT(i));
+ printk(KERN_CONT "%s:%lluKB ", mem_cgroup_lru_names[i], K(val));
+ }
+ printk(KERN_CONT "\n");
This is nice and simple I am just thinking whether it is enough. Say
that you have a deeper hierarchy and the there is a safety limit in the
its root
A (limit)
/|\
B C D
|\
E F

and we trigger an OOM on the A's limit. Now we know that something blew
up but what it was we do not know. Wouldn't it be better to swap the for
and for_each_mem_cgroup_tree loops? Then we would see the whole
hierarchy and can potentially point at the group which doesn't behave.
Memory cgroup stats for A/: ...
Memory cgroup stats for A/B/: ...
Memory cgroup stats for A/C/: ...
Memory cgroup stats for A/D/: ...
Memory cgroup stats for A/D/E/: ...
Memory cgroup stats for A/D/F/: ...

Would it still fit in with your use case?
[...]

We haven't used those complicate hierarchy yet, but it sounds a good suggestion. :)
Hierarchy is a little complex to use from our experience, and the three cgroups involved in memcg oom can be different: memcg of invoker, killed task, memcg of going over limit.Suppose a process in B triggers oom and a victim in root A is selected to be killed, we may as well want to know memcg stats just local in A cgroup(excludes BCD). So besides hierarchy info, does it acceptable to also print the local root node stats which as I did in the V1 version(https://lkml.org/lkml/2012/7/30/179).

Another one I'm hesitating is numa stats, it seems the output is beginning to get more and more....


Thanks,
Sha

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/