Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

From: Johannes Weiner
Date: Tue Oct 03 2023 - 11:59:52 EST


On Tue, Oct 03, 2023 at 02:58:58PM +0200, Michal Hocko wrote:
> On Mon 02-10-23 17:18:27, Nhat Pham wrote:
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > For instance, here is one of our usecases: suppose there are two 32G
> > containers. The machine is booted with hugetlb_cma=6G, and each
> > container may or may not use up to 3 gigantic page, depending on the
> > workload within it. The rest is anon, cache, slab, etc. We can set the
> > hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
> > But it is very difficult to configure memory.max to keep overall
> > consumption, including anon, cache, slab etc. fair.
> >
> > What we have had to resort to is to constantly poll hugetlb usage and
> > readjust memory.max. Similar procedure is done to other memory limits
> > (memory.low for e.g). However, this is rather cumbersome and buggy.
>
> Could you expand some more on how this _helps_ memory.low? The
> hugetlb memory is not reclaimable so whatever portion of its memcg
> consumption will be "protected from the reclaim". Consider this
> parent
> / \
> A B
> low=50% low=0
> current=40% current=60%
>
> We have an external memory pressure and the reclaim should prefer B as A
> is under its low limit, correct? But now consider that the predominant
> consumption of B is hugetlb which would mean the memory reclaim cannot
> do much for B and so the A's protection might be breached.
>
> As an admin (or a tool) you need to know about hugetlb as a potential
> contributor to this behavior (sure mlocked memory would behave the same
> but mlock rarely consumes huge amount of memory in my experience).
> Without the accounting there might not be any external pressure in the
> first place.
>
> All that being said, I do not see how adding hugetlb into accounting
> makes low, min limits management any easier.

It's important to differentiate the cgroup usecases. One is of course
the cloud/virtual server scenario, where you set the hard limits to
whatever the customer paid for, and don't know and don't care about
the workload running inside. In that case, memory.low and overcommit
aren't really safe to begin with due to unknown unreclaimable mem.

The other common usecase is the datacenter where you run your own
applications. You understand their workingset and requirements, and
configure and overcommit the containers in a way where jobs always
meet their SLAs. E.g. if multiple containers spike, memory.low is set
such that interactive workloads are prioritized over batch jobs, and
both have priority over routine system management tasks.

This is arguably the only case where it's safe to use memory.low. You
have to know what's reclaimable and what isn't, otherwise you cannot
know that memory.low will even do anything, and isolation breaks down.
So we already have that knowledge: mlocked sections, how much anon is
without swap space, and how much memory must not be reclaimed (even if
it is reclaimable) for the workload to meet its SLAs. Hugetlb doesn't
really complicate this equation - we already have to consider it
unreclaimable workingset from an overcommit POV on those hosts.

The reason this patch helps in this scenario is that the service teams
are usually different from the containers/infra team. The service
understands their workload and declares its workingset. But it's the
infra team running the containers that currently has to go and find
out if they're using hugetlb and tweak the cgroups. Bugs and
untimeliness in the tweaking have caused multiple production incidents
already. And both teams are regularly confused when there are large
parts of the workload that don't show up in memory.current which both
sides monitor. Keep in mind that these systems are already pretty
complex, with multiple overcommitted containers and system-level
activity. The current hugetlb quirk can heavily distort what a given
container is doing on the host.

With this patch, the service can declare its workingset, the container
team can configure the container, and memory.current makes sense to
everybody. The workload parameters are pretty reliable, but if the
service team gets it wrong and we underprotect the workload, and/or
its unreclaimable memory exceeds what was declared, the infra team
gets alarms on elevated LOW breaching events and investigates if its
an infra problem or a service spec problem that needs escalation.

So the case you describe above only happens when mistakes are made,
and we detect and rectify them. In the common case, hugetlb is part of
the recognized workingset, and we configure memory.low to cut off only
known optional and reclaimable memory under pressure.