Re: [PATCH 0/2] hugetlb memcg accounting

From: Nhat Pham
Date: Tue Sep 26 2023 - 20:15:06 EST


On Tue, Sep 26, 2023 at 1:50 PM Frank van der Linden <fvdl@xxxxxxxxxx> wrote:
>
> On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> >
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > This patch series rectifies this issue by charging the memcg when the
> > hugetlb folio is allocated, and uncharging when the folio is freed. In
> > addition, a new selftest is added to demonstrate and verify this new
> > behavior.
> >
> > Nhat Pham (2):
> > hugetlb: memcg: account hugetlb-backed memory in memory controller
> > selftests: add a selftest to verify hugetlb usage in memcg
> >
> > MAINTAINERS | 2 +
> > fs/hugetlbfs/inode.c | 2 +-
> > include/linux/hugetlb.h | 6 +-
> > include/linux/memcontrol.h | 8 +
> > mm/hugetlb.c | 23 +-
> > mm/memcontrol.c | 40 ++++
> > tools/testing/selftests/cgroup/.gitignore | 1 +
> > tools/testing/selftests/cgroup/Makefile | 2 +
> > .../selftests/cgroup/test_hugetlb_memcg.c | 222 ++++++++++++++++++
> > 9 files changed, 297 insertions(+), 9 deletions(-)
> > create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
> >
> > --
> > 2.34.1
> >
>
> We've had this behavior at Google for a long time, and we're actually
> getting rid of it. hugetlb pages are a precious resource that should
> be accounted for separately. They are not just any memory, they are
> physically contiguous memory, charging them the same as any other
> region of the same size ended up not making sense, especially not for
> larger hugetlb page sizes.

I agree hugetlb is a special kind of resource. But as Johannes
pointed out, it is still a form of memory. Semantically, its usage
should be modulated by the memory controller.

We do have the HugeTLB controller for hugetlb-specific
restriction, and where appropriate we definitely should take
advantage of it. But it does not fix the hole we have in memory
usage reporting, as well as (over)protection and reclaim dynamics.
Hence the need for the userspace hack (as Johannes described):
manually adding/subtracting HugeTLB usage where applicable.
This is not only inelegant, but also cumbersome and buggy.

>
> Additionally, if this behavior is changed just like that, there will
> be quite a few workloads that will break badly because they'll hit
> their limits immediately - imagine a container that uses 1G hugetlb
> pages to back something large (a database, a VM), and 'plain' memory
> for control processes.
>
> What do your workloads do? Is it not possible for you to account for
> hugetlb pages separately? Sure, it can be annoying to have to deal
> with 2 separate totals that you need to take into account, but again,
> hugetlb pages are a resource that is best dealt with separately.
>

Johannes beat me to it - he described our use case, and what we
have hacked together to temporarily get around the issue.

A knob/flag to turn on/off this behavior sounds good to me.

> - Frank
Thanks for the comments, Frank!