Re: [PATCH v2 1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller

From: Michal Hocko
Date: Mon Oct 02 2023 - 09:43:25 EST


On Wed 27-09-23 17:57:22, Nhat Pham wrote:
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> This patch rectifies this issue by charging the memcg when the hugetlb
> folio is allocated, and uncharging when the folio is freed (analogous to
> the hugetlb controller).

This changelog is missing a lot of information. Both about the usecase
(we do not want to fish that out from archives in the future) and the
actual implementation and the reasoning behind that.

AFAICS you have decided to charge on the hugetlb use rather than hugetlb
allocation to the pool. I suspect the underlying reasoning is that pool
pages do not belong to anybody. This is a deliberate decision and it
should be documented as such.

It is also very important do describe subtle behavior properties that
might be rather unintuitive to users. Most notably
- there is no hugetlb pool management involved in the memcg
controller. One has to use hugetlb controller for that purpose.
Also the pre allocated pool as such doesn't belong to anybody so the
memcg host overcommit management has to consider it when configuring
hard limits.
- memcg limit reclaim doesn't assist hugetlb pages allocation when
hugetlb overcommit is configured (i.e. pages are not consumed from the
pool) which means that the page allocation might disrupt workloads
from other memcgs.
- failure to charge a hugetlb page results in SIGBUS rather
than memcg oom killer. That could be the case even if the
hugetlb pool still has pages available and there is
reclaimable memory in the memcg.
- hugetlb pages are contributing to memory reclaim protection
implicitly. This means that the low,min limits tunning has to consider
hugetlb memory as well.

I suspect there is more than the above. To be completely honest I am
still not convinced this is a good idea.

I do recognize that this might work in a very limited environments but
hugetlb management is quite challenging on its own and this just adds
another layer of complexity which is really hard to see through without
an intimate understanding of both memcg and hugetlb. The reason that
hugetlb has been living outside of the core MM (and memcg) is not just
because we like it that way. And yes I do fully understand that users
shouldn't really care about that because this is just a memory to them.

We should also consider the global control for this functionality. I am
especially worried about setups where a mixed bag of workloads
(containers) is executed. While some of them will be ready for the new
accounting mode many will leave in their own world without ever being
modified. How do we deal with that situation?

All that being said, I am not going to ack nor nack this but I really do
prefer to be much more explicit about the motivation and current
implementation specifics so that we can forward users to something
they can digest.

> Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx>
[...]

a minor implementation detail below. I couldn't spot anything obviously
broken with the rest of the hugetlb specific code. restore_reserve_on_memcg_failure
is rather clumsy and potentially error prone but I will leave that out
to Mike as he is much more familiar with that behavior than me.

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index de220e3ff8be..ff88ea4df11a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
[...]
> @@ -3119,6 +3121,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> pages_per_huge_page(h), folio);
> }
> +
> + /* undo allocation if memory controller disallows it. */
> + if (mem_cgroup_hugetlb_charge_folio(folio, GFP_KERNEL)) {

htlb_alloc_mask(h) rather than GFP_KERNEL. Ideally with
__GFP_RETRY_MAYFAIL which is a default allocation policy.

> + if (restore_reserve_on_memcg_failure)
> + restore_reserve_on_error(h, vma, addr, folio);
> + folio_put(folio);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> return folio;
>
> out_uncharge_cgroup:

--
Michal Hocko
SUSE Labs