Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

From: Johannes Weiner
Date: Tue Oct 03 2023 - 14:39:35 EST


On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote:
> On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:
> >
> > On 10/02/23 17:18, Nhat Pham wrote:
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index de220e3ff8be..74472e911b0a 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
> > > pages_per_huge_page(h), folio);
> > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > > pages_per_huge_page(h), folio);
> > > + mem_cgroup_uncharge(folio);
> > > if (restore_reserve)
> > > h->resv_huge_pages++;
> > >
> > > @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > struct hugepage_subpool *spool = subpool_vma(vma);
> > > struct hstate *h = hstate_vma(vma);
> > > struct folio *folio;
> > > - long map_chg, map_commit;
> > > + long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
> > > long gbl_chg;
> > > - int ret, idx;
> > > + int memcg_charge_ret, ret, idx;
> > > struct hugetlb_cgroup *h_cg = NULL;
> > > + struct mem_cgroup *memcg;
> > > bool deferred_reserve;
> > > + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> > > +
> > > + memcg = get_mem_cgroup_from_current();
> > > + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
> > > + if (memcg_charge_ret == -ENOMEM) {
> > > + mem_cgroup_put(memcg);
> > > + return ERR_PTR(-ENOMEM);
> > > + }
> > >
> > > idx = hstate_index(h);
> > > /*
> > > @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > * code of zero indicates a reservation exists (no change).
> > > */
> > > map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> > > - if (map_chg < 0)
> > > + if (map_chg < 0) {
> > > + if (!memcg_charge_ret)
> > > + mem_cgroup_cancel_charge(memcg, nr_pages);
> > > + mem_cgroup_put(memcg);
> > > return ERR_PTR(-ENOMEM);
> > > + }
> > >
> > > /*
> > > * Processes that did not create the mapping will have no
> > > @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > */
> > > if (map_chg || avoid_reserve) {
> > > gbl_chg = hugepage_subpool_get_pages(spool, 1);
> > > - if (gbl_chg < 0) {
> > > - vma_end_reservation(h, vma, addr);
> > > - return ERR_PTR(-ENOSPC);
> > > - }
> > > + if (gbl_chg < 0)
> > > + goto out_end_reservation;
> > >
> > > /*
> > > * Even though there was no reservation in the region/reserve
> > > @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > > pages_per_huge_page(h), folio);
> > > }
> > > +
> > > + if (!memcg_charge_ret)
> > > + mem_cgroup_commit_charge(folio, memcg);
> > > + mem_cgroup_put(memcg);
> > > +
> > > return folio;
> > >
> > > out_uncharge_cgroup:
> > > @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > out_subpool_put:
> > > if (map_chg || avoid_reserve)
> > > hugepage_subpool_put_pages(spool, 1);
> > > +out_end_reservation:
> > > vma_end_reservation(h, vma, addr);
> > > + if (!memcg_charge_ret)
> > > + mem_cgroup_cancel_charge(memcg, nr_pages);
> > > + mem_cgroup_put(memcg);
> > > return ERR_PTR(-ENOSPC);
> > > }
> > >
> >
> > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in
> > free_huge_folio. During migration, huge pages are allocated via
> > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no
> > charging for the migration target page and we uncharge the source page.
> > It looks like there will be no charge for the huge page after migration?
> >
>
> Ah I see! This is a bit subtle indeed.
>
> For the hugetlb controller, it looks like they update the cgroup info
> inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate()
> to transfer the hugetlb cgroup info to the destination folio.
>
> Perhaps we can do something analogous here.
>
> > If my analysis above is correct, then we may need to be careful about
> > this accounting. We may not want both source and target pages to be
> > charged at the same time.
>
> We can create a variant of mem_cgroup_migrate that does not double
> charge, but instead just copy the mem_cgroup information to the new
> folio, and then clear that info from the old folio. That way the memory
> usage counters are untouched.
>
> Somebody with more expertise on migration should fact check me
> of course :)

The only reason mem_cgroup_migrate() double charges right now is
because it's used by replace_page_cache_folio(). In that context, the
isolation of the old page isn't quite as thorough as with migration,
so it cannot transfer and uncharge directly. This goes back a long
time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40

If you rename the current implementation to mem_cgroup_replace_page()
for that one caller, you can add a mem_cgroup_migrate() variant which
is charge neutral and clears old->memcg_data. This can be used for
regular and hugetlb page migration. Something like this (totally
untested):

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4d3282493b6..17ec45bf3653 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
if (mem_cgroup_disabled())
return;

- /* Page cache replacement: new folio already charged? */
- if (folio_memcg(new))
- return;
-
memcg = folio_memcg(old);
VM_WARN_ON_ONCE_FOLIO(!memcg, old);
if (!memcg)
return;

- /* Force-charge the new page. The old one will be freed soon */
- if (!mem_cgroup_is_root(memcg)) {
- page_counter_charge(&memcg->memory, nr_pages);
- if (do_memsw_account())
- page_counter_charge(&memcg->memsw, nr_pages);
- }
-
- css_get(&memcg->css);
+ /* Transfer the charge and the css ref */
commit_charge(new, memcg);
-
- local_irq_save(flags);
- mem_cgroup_charge_statistics(memcg, nr_pages);
- memcg_check_events(memcg, folio_nid(new));
- local_irq_restore(flags);
+ old->memcg_data = 0;
}

DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);