Re: memory-cgroup bug

From: Michal Hocko
Date: Sun Nov 25 2012 - 08:55:49 EST


On Sun 25-11-12 13:05:24, Michal Hocko wrote:
> [Adding Kamezawa into CC]
>
> On Sun 25-11-12 01:10:47, azurIt wrote:
> > >Could you take few snapshots over time?
> >
> >
> > Here it is, now from different server, snapshot was taken every second
> > for 10 minutes (hope it's enough):
> > www.watchdog.sk/lkml/memcg-bug-2.tar.gz
>
> Hmm, interesting:
> $ grep . */memory.failcnt | cut -d: -f2 | awk 'BEGIN{min=666666}{if (prev>0) {diff=$1-prev; if (diff>max) max=diff; if (diff<min) min=diff; sum+=diff; n++} prev=$1}END{printf "min:%d max:%d avg:%f\n", min, max, sum/n}'
> min:16281 max:224048 avg:18818.943119
>
> So there is a lot of attempts to allocate which fail, every second!
> Will get to that later.
>
> The number of tasks in the group is stable (20):
> $ for i in *; do ls -d1 $i/[0-9]* | wc -l; done | sort | uniq -c
> 546 20
>
> And no task has been killed or spawned:
> $ for i in *; do ls -d1 $i/[0-9]* | cut -d/ -f2; done | sort | uniq
> 24495
> 24762
> 24774
> 24796
> 24798
> 24805
> 24813
> 24827
> 24831
> 24841
> 24842
> 24863
> 24892
> 24924
> 24931
> 25130
> 25131
> 25192
> 25193
> 25243
>
> $ for stack in [0-9]*/[0-9]*
> do
> head -n1 $stack/stack
> done | sort | uniq -c
> 9841 [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> 546 [<ffffffff811109b8>] do_truncate+0x58/0xa0
> 533 [<ffffffffffffffff>] 0xffffffffffffffff
>
> Tells us that the stacks are pretty much stable.
> $ grep do_truncate -r [0-9]* | cut -d/ -f2 | sort | uniq -c
> 546 24495
>
> So 24495 is stuck in do_truncate
> [<ffffffff811109b8>] do_truncate+0x58/0xa0
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> I suspect it is waiting for i_mutex. Who is holding that lock?
> Other tasks are blocked on the mem_cgroup_handle_oom either coming from
> the page fault path so i_mutex can be exluded or vfs_write (24796) and
> that one is interesting:
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes &inode->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> This smells like a deadlock. But kind strange one. The rapidly
> increasing failcnt suggests that somebody still tries to allocate but
> who when all of them hung in the mem_cgroup_handle_oom. This can be
> explained though.
> Memcg OOM killer let's only one process (which is able to lock the
> hierarchy by mem_cgroup_oom_lock) call mem_cgroup_out_of_memory and kill
> a process, while others are waiting on the wait queue. Once the killer
> is done it calls memcg_wakeup_oom which wakes up other tasks waiting on
> the queue. Those retry the charge, in a hope there is some memory freed
> in the meantime which hasn't happened so they get into OOM again (and
> again and again).
> This all usually works out except in this particular case I would bet
> my hat that the OOM selected task is pid 24495 which is blocked on the
> mutex which is held by one of the oom killer task so it cannot finish -
> thus free a memory.
>
> It seems that the current Linus' tree is affected as well.
>
> I will have to think about a solution but it sounds really tricky. It is
> not just ext3 that is affected.
>
> I guess we need to tell mem_cgroup_cache_charge that it should never
> reach OOM from add_to_page_cache_locked. This sounds quite intrusive to
> me. On the other hand it is really weird that an excessive writer might
> trigger a memcg OOM killer.

This is hackish but it should help you in this case. Kamezawa, what do
you think about that? Should we generalize this and prepare something
like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
automatically and use the function whenever we are in a locked context?
To be honest I do not like this very much but nothing more sensible
(without touching non-memcg paths) comes to my mind.
---
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..da50c83 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -448,7 +448,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
VM_BUG_ON(PageSwapBacked(page));

error = mem_cgroup_cache_charge(page, current->mm,
- gfp_mask & GFP_RECLAIM_MASK);
+ (gfp_mask | __GFP_NORETRY) & GFP_RECLAIM_MASK);
if (error)
goto out;

--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/