Re: [QUESTION] memcg page_counter seems broken in MADV_DONTNEED with THP enabled

From: Yang Shi
Date: Mon Nov 28 2022 - 15:01:53 EST


On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@xxxxxxxxxx> wrote:
>
> Hi,
>
> We use mm_counter to how much a process physical memory used. Meanwhile,
> page_counter of a memcg is used to count how much a cgroup physical
> memory used.
> If a cgroup only contains a process, they looks almost the same. But with
> THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or
> more than rss
> in proc/[pid]/smaps_rollup as follow:
>
> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes
> 1080930304
> [root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs
> 1290
> [root@localhost sda]# cat /proc/1290/smaps_rollup
> 55ba80600000-ffffffffff601000 ---p 00000000 00:00 0
> [rollup]
> Rss: 500648 kB
> Pss: 498337 kB
> Shared_Clean: 2732 kB
> Shared_Dirty: 0 kB
> Private_Clean: 364 kB
> Private_Dirty: 497552 kB
> Referenced: 500648 kB
> Anonymous: 492016 kB
> LazyFree: 0 kB
> AnonHugePages: 129024 kB
> ShmemPmdMapped: 0 kB
> Shared_Hugetlb: 0 kB
> Private_Hugetlb: 0 kB
> Swap: 0 kB
> SwapPss: 0 kB
> Locked: 0 kB
> THPeligible: 0
>
> I have found the differences was because that __split_huge_pmd decrease
> the mm_counter but page_counter in memcg was not decreased with refcount
> of a head page is not zero. Here are the follows:
>
> do_madvise
> madvise_dontneed_free
> zap_page_range
> unmap_single_vma
> zap_pud_range
> zap_pmd_range
> __split_huge_pmd
> __split_huge_pmd_locked
> __mod_lruvec_page_state
> zap_pte_range
> add_mm_rss_vec
> add_mm_counter -> decrease the
> mm_counter
> tlb_finish_mmu
> arch_tlb_finish_mmu
> tlb_flush_mmu_free
> free_pages_and_swap_cache
> release_pages
> folio_put_testzero(page) -> not zero, skip
> continue;
> __folio_put_large
> free_transhuge_page
> free_compound_page
> mem_cgroup_uncharge
> page_counter_uncharge -> decrease the
> page_counter
>
> node_page_stat which shows in meminfo was also decreased. the
> __split_huge_pmd
> seems free no physical memory unless the total THP was free.I am
> confused which
> one is the true physical memory used of a process.

This should be caused by the deferred split of THP. When MADV_DONTNEED
is called on the partial of the map, the huge PMD is split, but the
THP itself will not be split until the memory pressure is hit (global
or memcg limit). So the unmapped sub pages are actually not freed
until that point. So the mm counter is decreased due to the zapping
but the physical pages are not actually freed then uncharged from
memcg.

>
>
> Kind regards,
>
> Yongqiang Liu
>
>