Re: [patch 13/13] mm: memcontrol: rewrite uncharge API

From: Naoya Horiguchi
Date: Tue Jul 15 2014 - 16:50:28 EST


On Tue, Jul 15, 2014 at 03:04:54PM -0400, Johannes Weiner wrote:
> On Tue, Jul 15, 2014 at 02:43:58PM -0400, Naoya Horiguchi wrote:
> > On Tue, Jul 15, 2014 at 01:34:39PM -0400, Johannes Weiner wrote:
> > > On Tue, Jul 15, 2014 at 06:07:35PM +0200, Michal Hocko wrote:
> > > > On Tue 15-07-14 11:55:37, Naoya Horiguchi wrote:
> > > > > On Wed, Jun 18, 2014 at 04:40:45PM -0400, Johannes Weiner wrote:
> > > > > ...
> > > > > > diff --git a/mm/swap.c b/mm/swap.c
> > > > > > index a98f48626359..3074210f245d 100644
> > > > > > --- a/mm/swap.c
> > > > > > +++ b/mm/swap.c
> > > > > > @@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
> > > > > > del_page_from_lru_list(page, lruvec, page_off_lru(page));
> > > > > > spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > > > > }
> > > > > > + mem_cgroup_uncharge(page);
> > > > > > }
> > > > > >
> > > > > > static void __put_single_page(struct page *page)
> > > > >
> > > > > This seems to cause a list breakage in hstate->hugepage_activelist
> > > > > when freeing a hugetlbfs page.
> > > >
> > > > This looks like a fall out from
> > > > http://marc.info/?l=linux-mm&m=140475936311294&w=2
> > > >
> > > > I didn't get to review this one but the easiest fix seems to be check
> > > > HugePage and do not call uncharge.
> > >
> > > Yes, that makes sense. I'm also moving the uncharge call into
> > > __put_single_page() and __put_compound_page() so that PageHuge(), a
> > > function call, only needs to be checked for compound pages.
> > >
> > > > > For hugetlbfs, we uncharge in free_huge_page() which is called after
> > > > > __page_cache_release(), so I think that we don't have to uncharge here.
> > > > >
> > > > > In my testing, moving mem_cgroup_uncharge() inside if (PageLRU) block
> > > > > fixed the problem, so if that works for you, could you fold the change
> > > > > into your patch?
> > >
> > > Memcg pages that *do* need uncharging might not necessarily be on the
> > > LRU list.
> >
> > OK.
> >
> > > Does the following work for you?
> >
> > Unfortunately, with this change I saw the following bug message when
> > stressing with hugepage migration.
> > move_to_new_page() is called by unmap_and_move_huge_page() too, so
> > we need some hugetlb related code around mem_cgroup_migrate().
>
> Can we just move hugetlb_cgroup_migrate() into move_to_new_page()? It
> doesn't seem to be dependent of any page-specific state.
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7f5a42403fae..219da52d2f43 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -781,7 +781,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> if (!PageAnon(newpage))
> newpage->mapping = NULL;
> } else {
> - mem_cgroup_migrate(page, newpage, false);
> + if (PageHuge(page))
> + hugetlb_cgroup_migrate(hpage, new_hpage);

hugetlb_cgroup_migrate(page, newpage);

to build successfully.

And yes, with this chanage the bug in move_to_new_page() is gone,
so we stepped one step further.

But I faced another bugs like below.

[ 56.692744] BUG: Bad page state in process sysctl pfn:71c00
[ 56.693722] page:ffffea0001c70000 count:0 mapcount:0 mapping: (null) index:0x8
[ 56.695121] page flags: 0x5fffff80004008(uptodate|head)
[ 56.695990] page dumped because: cgroup check failed
[ 56.696816] pc:ffff88007eb9c000 pc->flags:7 pc->mem_cgroup:ffff8800be59a800
[ 56.698059] Modules linked in: stap_6484a34ef9f0ebb4400874c66d0905ac__1496(O) bnep bluetooth ip6t_rpfilter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 cfg80211 xt_conntrack rfk
ill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_def
rag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ppdev microcode parport_pc serio_raw parport virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss o
id_registry nfs_acl lockd sunrpc virtio_blk virtio_net ata_generic pata_acpi floppy
[ 56.707416] CPU: 2 PID: 1872 Comm: sysctl Tainted: G B O 3.15.0-140715-1512-00017-gf1ab1502aa49 #264
[ 56.709024] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 56.709810] ffffffff81a8e0d5 ffff88003d787cb0 ffffffff8172d057 ffff88003d787cc8
[ 56.711158] ffffffff8172d08e ffffea0001c70000 ffff88003d787cf0 ffffffff8119e7a5
[ 56.712344] 0000000000000000 000fffff80000000 ffffffff81a8e0d5 ffff88003d787d28
[ 56.713551] Call Trace:
[ 56.714088] [<ffffffff8172d057>] __dump_stack+0x19/0x1b
[ 56.714793] [<ffffffff8172d08e>] dump_stack+0x35/0x46
[ 56.715546] [<ffffffff8119e7a5>] bad_page+0xd5/0x130
[ 56.716369] [<ffffffff8119e958>] free_pages_prepare+0x158/0x190
[ 56.717222] [<ffffffff8119edab>] __free_pages_ok+0x1b/0xb0
[ 56.717960] [<ffffffff8119f859>] __free_pages+0x29/0x50
[ 56.718710] [<ffffffff811dbce0>] update_and_free_page+0xd0/0x110
[ 56.719575] [<ffffffff811dd663>] free_pool_huge_page+0xd3/0xf0
[ 56.720407] [<ffffffff811dd7ec>] set_max_huge_pages+0x16c/0x1c0
[ 56.721255] [<ffffffff811dd968>] __nr_hugepages_store_common+0x128/0x1a0
[ 56.722203] [<ffffffff811ddb28>] hugetlb_sysctl_handler_common+0x98/0xb0
[ 56.723147] [<ffffffff811de56e>] hugetlb_sysctl_handler+0x1e/0x20
[ 56.723962] [<ffffffff8127a103>] proc_sys_call_handler+0xa3/0xb0
[ 56.724805] [<ffffffff8127a124>] proc_sys_write+0x14/0x20
[ 56.725844] [<ffffffff8120921a>] vfs_write+0xba/0x1e0
[ 56.726792] [<ffffffff81209d8d>] SyS_write+0x4d/0xc0
[ 56.727596] [<ffffffff81742a12>] system_call_fastpath+0x16/0x1b
[ 58.894865] page:ffffea0001cf8000 count:2 mapcount:0 mapping:ffff88003d481278 index:0x1
[ 58.896112] page flags: 0x5fffff80004809(locked|uptodate|private|head)
[ 58.897148] page dumped because: VM_BUG_ON_PAGE(PageCgroupUsed(pc))
[ 58.899325] pc:ffff88007ebbe000 pc->flags:7 pc->mem_cgroup:ffff8800be59a800
[ 58.900359] ------------[ cut here ]------------
[ 58.901016] kernel BUG at /src/linux-dev/mm/memcontrol.c:2707!
[ 58.901331] invalid opcode: 0000 [#1] SMP
[ 58.901331] Modules linked in: stap_6484a34ef9f0ebb4400874c66d0905ac__1496(O) bnep bluetooth ip6t_rpfilter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 cfg80211 xt_conntrack rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ppdev microcode parport_pc serio_raw parport virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc virtio_blk virtio_net ata_generic pata_acpi floppy
[ 58.901331] CPU: 1 PID: 1918 Comm: mbind_fuzz Tainted: G B O 3.15.0-140715-1512-00017-gf1ab1502aa49 #264
[ 58.901331] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 58.901331] task: ffff8800bd763b20 ti: ffff8800bd750000 task.ti: ffff8800bd750000
[ 58.901331] RIP: 0010:[<ffffffff811fee3b>] [<ffffffff811fee3b>] commit_charge+0x28b/0x2b0
[ 58.901331] RSP: 0000:ffff8800bd753c38 EFLAGS: 00010296
[ 58.901331] RAX: 000000000000003f RBX: ffffea0001cf8000 RCX: 0000000000000000
[ 58.901331] RDX: 0000000000000001 RSI: ffff88007ec0d318 RDI: ffff88007ec0d318
[ 58.901331] RBP: ffff8800bd753c78 R08: 000000000000000a R09: 0000000000000000
[ 58.901331] R10: 0000000000000000 R11: ffff8800bd75390e R12: ffff8800be59a800
[ 58.901331] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88007ebbe000
[ 58.901331] FS: 00007f9ce6fa0740(0000) GS:ffff88007ec00000(0000) knlGS:0000000000000000
[ 58.901331] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 58.901331] CR2: 0000700004600000 CR3: 000000007c194000 CR4: 00000000000006e0
[ 58.901331] Stack:
[ 58.901331] ffff8800be59a800 ffffea0001cf8000 000002003d481290 ffffea0001cf8000
[ 58.901331] ffff88003d481278 0000000000000000 ffff88003d481290 00000000000000d0
[ 58.901331] ffff8800bd753c90 ffffffff812020fc ffffea0001cf8000 ffff8800bd753cd8
[ 58.901331] Call Trace:
[ 58.901331] [<ffffffff812020fc>] mem_cgroup_commit_charge+0x6c/0xf0
[ 58.901331] [<ffffffff81196c8c>] __add_to_page_cache_locked+0xec/0x1e0
[ 58.901331] [<ffffffff81196d91>] add_to_page_cache_locked+0x11/0x20
[ 58.901331] [<ffffffff811df425>] hugetlb_no_page+0x105/0x3b0
[ 58.901331] [<ffffffff8138f799>] ? __rb_insert_augmented+0xf9/0x1e0
[ 58.901331] [<ffffffff811e02f4>] hugetlb_fault+0x2c4/0x3c0
[ 58.901331] [<ffffffff811bd184>] ? vma_interval_tree_insert+0x84/0x90
[ 58.901331] [<ffffffff811c5d93>] __handle_mm_fault+0x303/0x340
[ 58.901331] [<ffffffff811c5e5f>] handle_mm_fault+0x8f/0x130
[ 58.901331] [<ffffffff8173d3f6>] __do_page_fault+0x176/0x520
[ 58.901331] [<ffffffff8132d993>] ? file_map_prot_check+0x63/0xd0
[ 58.901331] [<ffffffff811b46a9>] ? vm_mmap_pgoff+0x99/0xc0
[ 58.901331] [<ffffffff8173d7ac>] do_page_fault+0xc/0x10
[ 58.901331] [<ffffffff8173a122>] page_fault+0x22/0x30
[ 58.901331] Code: 13 45 19 c0 41 83 e0 02 48 c1 ea 06 83 e2 01 48 83 fa 01 41 83 d8 ff e9 30 ff ff ff 48 c7 c6 20 d0 a8 81 48 89 df e8 55 fb f9 ff <0f> 0b 48 c7 c6 f3 e2 a8 81 48 89 df e8 44 fb f9 ff 0f 0b 48 c7
[ 58.901331] RIP [<ffffffff811fee3b>] commit_charge+0x28b/0x2b0
[ 58.901331] RSP <ffff8800bd753c38>
[ 58.944251] ---[ end trace 2f1aecd49dae161f ]---

I feel that these 2 messages have the same cause (just appear differently).
__add_to_page_cache_locked() (and mem_cgroup_try_charge()) can be called
for hugetlb, while we avoid calling mem_cgroup_migrate()/mem_cgroup_uncharge()
for hugetlb. This seems to make page_cgroup of the hugepage inconsistent,
and results in the bad page bug ("page dumped because: cgroup check failed").
So maybe some more PageHuge check is necessary around the charging code.

Thanks,
Naoya Horiguchi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/