Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patchremoved from -mm tree

From: David Rientjes
Date: Wed Nov 27 2013 - 20:04:14 EST


On Wed, 27 Nov 2013, Johannes Weiner wrote:

> > The memcg oom killer has incurred a serious regression since the 3.12-rc6
> > kernel where this was merged as 4942642080ea ("mm: memcg: handle non-error
> > OOM situations more gracefully"). It cc'd stable@xxxxxxxxxx although it
> > doesn't appear to have been picked up yet, and I'm hoping that we can
> > avoid having it merged in a stable kernel until we get this fixed.
> >
> > This patch, specifically the above, allows memcgs to bypass their limits
> > by charging the root memcg in oom conditions.
> >
> > If I create a memcg, cg1, with memory.limit_in_bytes == 128MB and start a
> > memory allocator in it to induce oom, the memcg limit is trivially broken:
> >
> > membench invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
> > membench cpuset=/ mems_allowed=0-3
> > CPU: 9 PID: 11388 Comm: membench Not tainted 3.13-rc1
> > ffffc90015ec6000 ffff880671c3dc18 ffffffff8154a1e3 0000000000000007
> > ffff880674c215d0 ffff880671c3dc98 ffffffff81548b45 ffff880671c3dc58
> > ffffffff81151de7 0000000000000001 0000000000000292 ffff880800000000
> > Call Trace:
> > [<ffffffff8154a1e3>] dump_stack+0x46/0x58
> > [<ffffffff81548b45>] dump_header+0x7a/0x1bb
> > [<ffffffff81151de7>] ? find_lock_task_mm+0x27/0x70
> > [<ffffffff812e8b76>] ? ___ratelimit+0x96/0x110
> > [<ffffffff811521c4>] oom_kill_process+0x1c4/0x330
> > [<ffffffff81099ee5>] ? has_ns_capability_noaudit+0x15/0x20
> > [<ffffffff811a916a>] mem_cgroup_oom_synchronize+0x50a/0x570
> > [<ffffffff811a8660>] ? __mem_cgroup_try_charge_swapin+0x70/0x70
> > [<ffffffff81152968>] pagefault_out_of_memory+0x18/0x90
> > [<ffffffff81547b86>] mm_fault_error+0xb7/0x15a
> > [<ffffffff81553efb>] __do_page_fault+0x42b/0x500
> > [<ffffffff810c667d>] ? set_next_entity+0xad/0xd0
> > [<ffffffff810c670b>] ? pick_next_task_fair+0x6b/0x170
> > [<ffffffff8154d08e>] ? __schedule+0x38e/0x780
> > [<ffffffff81553fde>] do_page_fault+0xe/0x10
> > [<ffffffff815509e2>] page_fault+0x22/0x30
> > Task in /cg1 killed as a result of limit of /cg1
> > memory: usage 131072kB, limit 131072kB, failcnt 1053
> > memory+swap: usage 0kB, limit 18014398509481983kB, failcnt 0
> > kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
> > Memory cgroup stats for /cg1: cache:84KB rss:130988KB rss_huge:116736KB mapped_file:72KB writeback:0KB inactive_anon:0KB active_anon:130976KB inactive_file:4KB active_file:0KB unevictable:0KB
> > [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
> > [ 7761] 0 7761 1106 483 5 0 0 bash
> > [11388] 99 11388 270773 33031 83 0 0 membench
> > Memory cgroup out of memory: Kill process 11388 (membench) score 1010 or sacrifice child
> > Killed process 11388 (membench) total-vm:1083092kB, anon-rss:130824kB, file-rss:1300kB
> >
> > The score of 1010 shown for pid 11388 (membench) should never happen in
> > the oom killer, the maximum value should always be 1000 in any oom
> > context. This indicates that the process has allocated more memory than
> > is available to the memcg. The rss value, 33031 pages, shows that it has
> > allocated >129MB of memory in a memcg limited to 128MB.
> >
> > The entire premise of memcg is to prevent processes attached to it to not
> > be able to allocate more memory than allowed and this trivially breaks
> > that premise in oom conditions.
>
> We already allow a task to allocate beyond the limit if it's selected
> by the OOM killer, so that it can exit faster.
>
> My patch added that a task can bypass the limit when it decided to
> trigger the OOM killer, so that it can get to the OOM kill faster.
>

The task that is bypassing the memcg charge to the root memcg may not be
the process that is chosen by the oom killer, and it's possible the amount
of memory freed by killing the victim is less than the amount of memory
bypassed.

> So I don't think my patch has broken "the entire premise of memcgs".
> At the same time, it also does not really rely on that bypass, we
> should be able to remove it.
>
> This patch series was not supposed to go into the last merge window, I
> already told stable to hold off on these until further notice.
>

Were you targeting these to 3.13 instead? If so, it would have already
appeared in 3.13-rc1 anyway. Is it still a work in progress?

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 13b9d0f..5f9e467 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2675,7 +2675,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> goto bypass;
>
> if (unlikely(task_in_memcg_oom(current)))
> - goto bypass;
> + goto nomem;
>
> /*
> * We always charge the cgroup the mm_struct belongs to.

Is there any benefit to doing this over just schedule_timeout_killable()
since we need to wait for mem_cgroup_oom_synchronize() to be able to make
forward progress at this point?

Should we be checking mem_cgroup_margin() here to ensure
task_in_memcg_oom() is still accurate and we haven't raced by freeing
memory?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/