Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

From: Nhat Pham
Date: Mon Oct 02 2023 - 20:26:27 EST


On Mon, Oct 2, 2023 at 5:18 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> For instance, here is one of our usecases: suppose there are two 32G
> containers. The machine is booted with hugetlb_cma=6G, and each
> container may or may not use up to 3 gigantic page, depending on the
> workload within it. The rest is anon, cache, slab, etc. We can set the
> hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
> But it is very difficult to configure memory.max to keep overall
> consumption, including anon, cache, slab etc. fair.
>
> What we have had to resort to is to constantly poll hugetlb usage and
> readjust memory.max. Similar procedure is done to other memory limits
> (memory.low for e.g). However, this is rather cumbersome and buggy.
> Furthermore, when there is a delay in memory limits correction, (for e.g
> when hugetlb usage changes within consecutive runs of the userspace
> agent), the system could be in an over/underprotected state.
>
> This patch rectifies this issue by charging the memcg when the hugetlb
> folio is utilized, and uncharging when the folio is freed (analogous to
> the hugetlb controller). Note that we do not charge when the folio is
> allocated to the hugetlb pool, because at this point it is not owned by
> any memcg.
>
> Some caveats to consider:
> * This feature is only available on cgroup v2.
> * There is no hugetlb pool management involved in the memory
> controller. As stated above, hugetlb folios are only charged towards
> the memory controller when it is used. Host overcommit management
> has to consider it when configuring hard limits.
> * Failure to charge towards the memcg results in SIGBUS. This could
> happen even if the hugetlb pool still has pages (but the cgroup
> limit is hit and reclaim attempt fails).
> * When this feature is enabled, hugetlb pages contribute to memory
> reclaim protection. low, min limits tuning must take into account
> hugetlb memory.
> * Hugetlb pages utilized while this option is not selected will not
> be tracked by the memory controller (even if cgroup v2 is remounted
> later on).
(special thanks to Michal Hocko, who pointed most of these out).
>
> Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 29 ++++++++++++++++++++
> include/linux/cgroup-defs.h | 5 ++++
> include/linux/memcontrol.h | 9 +++++++
> kernel/cgroup/cgroup.c | 15 ++++++++++-
> mm/hugetlb.c | 35 ++++++++++++++++++++-----
> mm/memcontrol.c | 35 +++++++++++++++++++++++++
> 6 files changed, 120 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 622a7f28db1f..606b2e0eac4b 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -210,6 +210,35 @@ cgroup v2 currently supports the following mount options.
> relying on the original semantics (e.g. specifying bogusly
> high 'bypass' protection values at higher tree levels).
>
> + memory_hugetlb_accounting
> + Count HugeTLB memory usage towards the cgroup's overall
> + memory usage for the memory controller (for the purpose of
> + statistics reporting and memory protetion). This is a new
> + behavior that could regress existing setups, so it must be
> + explicitly opted in with this mount option.
> +
> + A few caveats to keep in mind:
> +
> + * There is no HugeTLB pool management involved in the memory
> + controller. The pre-allocated pool does not belong to anyone.
> + Specifically, when a new HugeTLB folio is allocated to
> + the pool, it is not accounted for from the perspective of the
> + memory controller. It is only charged to a cgroup when it is
> + actually used (for e.g at page fault time). Host memory
> + overcommit management has to consider this when configuring
> + hard limits. In general, HugeTLB pool management should be
> + done via other mechanisms (such as the HugeTLB controller).
> + * Failure to charge a HugeTLB folio to the memory controller
> + results in SIGBUS. This could happen even if the HugeTLB pool
> + still has pages available (but the cgroup limit is hit and
> + reclaim attempt fails).
> + * Charging HugeTLB memory towards the memory controller affects
> + memory protection and reclaim dynamics. Any userspace tuning
> + (of low, min limits for e.g) needs to take this into account.
> + * HugeTLB pages utilized while this option is not selected
> + will not be tracked by the memory controller (even if cgroup
> + v2 is remounted later on).
> +
>
> Organizing Processes and Threads
> --------------------------------
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index f1b3151ac30b..8641f4320c98 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -115,6 +115,11 @@ enum {
> * Enable recursive subtree protection
> */
> CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
> +
> + /*
> + * Enable hugetlb accounting for the memory controller.
> + */
> + CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
> };
>
> /* cftype->flags */
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 42bf7e9b1a2f..a827e2129790 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -679,6 +679,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
> return __mem_cgroup_charge(folio, mm, gfp);
> }
>
> +int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> + long nr_pages);
> +
> int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> gfp_t gfp, swp_entry_t entry);
> void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
> @@ -1262,6 +1265,12 @@ static inline int mem_cgroup_charge(struct folio *folio,
> return 0;
> }
>
> +static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
> + gfp_t gfp, long nr_pages)
> +{
> + return 0;
> +}
> +
> static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
> struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
> {
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1fb7f562289d..f11488b18ceb 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1902,6 +1902,7 @@ enum cgroup2_param {
> Opt_favordynmods,
> Opt_memory_localevents,
> Opt_memory_recursiveprot,
> + Opt_memory_hugetlb_accounting,
> nr__cgroup2_params
> };
>
> @@ -1910,6 +1911,7 @@ static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
> fsparam_flag("favordynmods", Opt_favordynmods),
> fsparam_flag("memory_localevents", Opt_memory_localevents),
> fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot),
> + fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting),
> {}
> };
>
> @@ -1936,6 +1938,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
> case Opt_memory_recursiveprot:
> ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
> return 0;
> + case Opt_memory_hugetlb_accounting:
> + ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
> + return 0;
> }
> return -EINVAL;
> }
> @@ -1960,6 +1965,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
> cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
> else
> cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
> +
> + if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> + cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
> + else
> + cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
> }
> }
>
> @@ -1973,6 +1983,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
> seq_puts(seq, ",memory_localevents");
> if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
> seq_puts(seq, ",memory_recursiveprot");
> + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> + seq_puts(seq, ",memory_hugetlb_accounting");
> return 0;
> }
>
> @@ -7050,7 +7062,8 @@ static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
> "nsdelegate\n"
> "favordynmods\n"
> "memory_localevents\n"
> - "memory_recursiveprot\n");
> + "memory_recursiveprot\n"
> + "memory_hugetlb_accounting\n");
> }
> static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index de220e3ff8be..74472e911b0a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
> pages_per_huge_page(h), folio);
> hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> pages_per_huge_page(h), folio);
> + mem_cgroup_uncharge(folio);
> if (restore_reserve)
> h->resv_huge_pages++;
>
> @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> struct hugepage_subpool *spool = subpool_vma(vma);
> struct hstate *h = hstate_vma(vma);
> struct folio *folio;
> - long map_chg, map_commit;
> + long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
> long gbl_chg;
> - int ret, idx;
> + int memcg_charge_ret, ret, idx;
> struct hugetlb_cgroup *h_cg = NULL;
> + struct mem_cgroup *memcg;
> bool deferred_reserve;
> + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> +
> + memcg = get_mem_cgroup_from_current();
> + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
> + if (memcg_charge_ret == -ENOMEM) {
> + mem_cgroup_put(memcg);
> + return ERR_PTR(-ENOMEM);
> + }
>
> idx = hstate_index(h);
> /*
> @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> * code of zero indicates a reservation exists (no change).
> */
> map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> - if (map_chg < 0)
> + if (map_chg < 0) {
> + if (!memcg_charge_ret)
> + mem_cgroup_cancel_charge(memcg, nr_pages);
> + mem_cgroup_put(memcg);
> return ERR_PTR(-ENOMEM);
> + }
>
> /*
> * Processes that did not create the mapping will have no
> @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> */
> if (map_chg || avoid_reserve) {
> gbl_chg = hugepage_subpool_get_pages(spool, 1);
> - if (gbl_chg < 0) {
> - vma_end_reservation(h, vma, addr);
> - return ERR_PTR(-ENOSPC);
> - }
> + if (gbl_chg < 0)
> + goto out_end_reservation;
>
> /*
> * Even though there was no reservation in the region/reserve
> @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> pages_per_huge_page(h), folio);
> }
> +
> + if (!memcg_charge_ret)
> + mem_cgroup_commit_charge(folio, memcg);
> + mem_cgroup_put(memcg);
> +
> return folio;
>
> out_uncharge_cgroup:
> @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> out_subpool_put:
> if (map_chg || avoid_reserve)
> hugepage_subpool_put_pages(spool, 1);
> +out_end_reservation:
> vma_end_reservation(h, vma, addr);
> + if (!memcg_charge_ret)
> + mem_cgroup_cancel_charge(memcg, nr_pages);
> + mem_cgroup_put(memcg);
> return ERR_PTR(-ENOSPC);
> }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0219befeae38..6660684f6f97 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7085,6 +7085,41 @@ int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp)
> return ret;
> }
>
> +/**
> + * mem_cgroup_hugetlb_try_charge - try to charge the memcg for a hugetlb folio
> + * @memcg: memcg to charge.
> + * @gfp: reclaim mode.
> + * @nr_pages: number of pages to charge.
> + *
> + * This function is called when allocating a huge page folio to determine if
> + * the memcg has the capacity for it. It does not commit the charge yet,
> + * as the hugetlb folio itself has not been obtained from the hugetlb pool.
> + *
> + * Once we have obtained the hugetlb folio, we can call
> + * mem_cgroup_commit_charge() to commit the charge. If we fail to obtain the
> + * folio, we should instead call mem_cgroup_cancel_charge() to undo the effect
> + * of try_charge().
> + *
> + * Returns 0 on success. Otherwise, an error code is returned.
> + */
> +int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> + long nr_pages)
> +{
> + /*
> + * If hugetlb memcg charging is not enabled, do not fail hugetlb allocation,
> + * but do not attempt to commit charge later (or cancel on error) either.
> + */
> + if (mem_cgroup_disabled() || !memcg ||
> + !cgroup_subsys_on_dfl(memory_cgrp_subsys) ||
> + !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
> + return -EOPNOTSUPP;
> +
> + if (try_charge(memcg, gfp, nr_pages))
> + return -ENOMEM;
> +
> + return 0;
> +}
> +
> /**
> * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin.
> * @folio: folio to charge.
> --
> 2.34.1