Re: [patch 13/13] mm: memcontrol: rewrite uncharge API

From: Sasha Levin
Date: Fri Jun 20 2014 - 20:35:24 EST


On 06/18/2014 04:40 PM, Johannes Weiner wrote:
> The memcg uncharging code that is involved towards the end of a page's
> lifetime - truncation, reclaim, swapout, migration - is impressively
> complicated and fragile.
>
> Because anonymous and file pages were always charged before they had
> their page->mapping established, uncharges had to happen when the page
> type could still be known from the context; as in unmap for anonymous,
> page cache removal for file and shmem pages, and swap cache truncation
> for swap pages. However, these operations happen well before the page
> is actually freed, and so a lot of synchronization is necessary:
>
> - Charging, uncharging, page migration, and charge migration all need
> to take a per-page bit spinlock as they could race with uncharging.
>
> - Swap cache truncation happens during both swap-in and swap-out, and
> possibly repeatedly before the page is actually freed. This means
> that the memcg swapout code is called from many contexts that make
> no sense and it has to figure out the direction from page state to
> make sure memory and memory+swap are always correctly charged.
>
> - On page migration, the old page might be unmapped but then reused,
> so memcg code has to prevent untimely uncharging in that case.
> Because this code - which should be a simple charge transfer - is so
> special-cased, it is not reusable for replace_page_cache().
>
> But now that charged pages always have a page->mapping, introduce
> mem_cgroup_uncharge(), which is called after the final put_page(),
> when we know for sure that nobody is looking at the page anymore.
>
> For page migration, introduce mem_cgroup_migrate(), which is called
> after the migration is successful and the new page is fully rmapped.
> Because the old page is no longer uncharged after migration, prevent
> double charges by decoupling the page's memcg association (PCG_USED
> and pc->mem_cgroup) from the page holding an actual charge. The new
> bits PCG_MEM and PCG_MEMSW represent the respective charges and are
> transferred to the new page during migration.
>
> mem_cgroup_migrate() is suitable for replace_page_cache() as well,
> which gets rid of mem_cgroup_replace_page_cache().
>
> Swap accounting is massively simplified: because the page is no longer
> uncharged as early as swap cache deletion, a new mem_cgroup_swapout()
> can transfer the page's memory+swap charge (PCG_MEMSW) to the swap
> entry before the final put_page() in page reclaim.
>
> Finally, page_cgroup changes are now protected by whatever protection
> the page itself offers: anonymous pages are charged under the page
> table lock, whereas page cache insertions, swapin, and migration hold
> the page lock. Uncharging happens under full exclusion with no
> outstanding references. Charging and uncharging also ensure that the
> page is off-LRU, which serializes against charge migration. Remove
> the very costly page_cgroup lock and set pc->flags non-atomically.
>
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>

Hi Johannes,

I'm seeing the following when booting a VM, bisection pointed me to this
patch.

[ 32.830823] BUG: using __this_cpu_add() in preemptible [00000000] code: mkdir/8677
[ 32.831522] caller is __this_cpu_preempt_check+0x13/0x20
[ 32.832079] CPU: 35 PID: 8677 Comm: mkdir Not tainted 3.16.0-rc1-next-20140620-sasha-00023-g8fc12ed #700
[ 32.832898] ffffffffb27ea69d ffff8800cb91b618 ffffffffb151820b 0000000000000002
[ 32.833607] 0000000000000023 ffff8800cb91b648 ffffffffaeb4c799 ffff88006efa5b60
[ 32.834318] ffffea0007cff9c0 0000000000000001 0000000000000001 ffff8800cb91b658
[ 32.835030] Call Trace:
[ 32.835257] dump_stack (lib/dump_stack.c:52)
[ 32.835755] check_preemption_disabled (./arch/x86/include/asm/preempt.h:80 lib/smp_processor_id.c:49)
[ 32.836336] __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 32.836991] mem_cgroup_charge_statistics.isra.23 (mm/memcontrol.c:930)
[ 32.837682] commit_charge (mm/memcontrol.c:2761)
[ 32.838187] ? _raw_spin_unlock_irq (./arch/x86/include/asm/paravirt.h:819 include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199)
[ 32.838735] ? get_parent_ip (kernel/sched/core.c:2546)
[ 32.839230] mem_cgroup_commit_charge (mm/memcontrol.c:6519)
[ 32.839807] __add_to_page_cache_locked (mm/filemap.c:588 include/linux/jump_label.h:115 include/trace/events/filemap.h:50 mm/filemap.c:589)
[ 32.840479] add_to_page_cache_lru (mm/filemap.c:627)
[ 32.841048] read_cache_pages (mm/readahead.c:92)
[ 32.841560] ? v9fs_cache_session_get_key (fs/9p/cache.c:306)
[ 32.842145] ? v9fs_write_begin (fs/9p/vfs_addr.c:99)
[ 32.842694] v9fs_vfs_readpages (fs/9p/vfs_addr.c:127)
[ 32.843251] __do_page_cache_readahead (mm/readahead.c:123 mm/readahead.c:200)
[ 32.843848] ? __do_page_cache_readahead (include/linux/rcupdate.h:877 mm/readahead.c:178)
[ 32.844435] ? __const_udelay (arch/x86/lib/delay.c:126)
[ 32.844944] filemap_fault (include/linux/memcontrol.h:141 include/linux/memcontrol.h:198 mm/filemap.c:1869)
[ 32.845465] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[ 32.845999] __do_fault (mm/memory.c:2705)
[ 32.846472] ? mem_cgroup_try_charge (include/linux/cgroup.h:158 mm/memcontrol.c:6467)
[ 32.847048] do_cow_fault (mm/memory.c:2936)
[ 32.847561] __handle_mm_fault (mm/memory.c:3078 mm/memory.c:3205 mm/memory.c:3322)
[ 32.848092] ? __const_udelay (arch/x86/lib/delay.c:126)
[ 32.848596] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[ 32.849157] handle_mm_fault (mm/memory.c:3345)
[ 32.849665] __do_page_fault (arch/x86/mm/fault.c:1230)
[ 32.850239] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
[ 32.850963] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:305)
[ 32.851442] ? sched_clock_local (kernel/sched/clock.c:214)
[ 32.852034] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[ 32.852669] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 32.853243] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
[ 32.853854] trace_do_page_fault (arch/x86/mm/fault.c:1313 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
[ 32.854393] do_async_page_fault (arch/x86/kernel/kvm.c:264)
[ 32.854924] async_page_fault (arch/x86/kernel/entry_64.S:1322)
[ 32.855507] ? __clear_user (arch/x86/lib/usercopy_64.c:22)
[ 32.855999] ? __clear_user (arch/x86/lib/usercopy_64.c:18 arch/x86/lib/usercopy_64.c:21)
[ 32.856488] clear_user (arch/x86/lib/usercopy_64.c:54)
[ 32.856997] padzero (fs/binfmt_elf.c:122)
[ 32.857440] load_elf_binary (fs/binfmt_elf.c:909 (discriminator 1))
[ 32.857949] ? search_binary_handler (fs/exec.c:1374)
[ 32.858550] ? preempt_count_sub (kernel/sched/core.c:2602)
[ 32.859089] search_binary_handler (fs/exec.c:1375)
[ 32.859654] do_execve_common.isra.19 (fs/exec.c:1412 fs/exec.c:1508)
[ 32.860319] ? do_execve_common.isra.19 (./arch/x86/include/asm/current.h:14 fs/exec.c:1406 fs/exec.c:1508)
[ 32.860949] do_execve (fs/exec.c:1551)
[ 32.861390] SyS_execve (fs/exec.c:1602)
[ 32.861848] stub_execve (arch/x86/kernel/entry_64.S:662)


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/