Re: [RFC PATCH 0/8] memory recharging for offline memcgs

From: Yosry Ahmed
Date: Thu Jul 20 2023 - 18:24:46 EST


On Thu, Jul 20, 2023 at 3:12 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> Hello,
>
> On Thu, Jul 20, 2023 at 02:34:16PM -0700, Yosry Ahmed wrote:
> > > Or just create a nesting layer so that there's a cgroup which represents the
> > > persistent resources and a nested cgroup instance inside representing the
> > > current instance.
> >
> > In practice it is not easy to know exactly which resources are shared
> > and used by which cgroups, especially in a large dynamic environment.
>
> Yeah, that only covers when resource persistence is confined in a known
> scope. That said, I have a hard time seeing how recharding once after cgroup
> destruction can be a solution for the situations you describe. What if A
> touches it once first, B constantly uses it but C only very occasionally and
> after A dies C ends up owning it due to timing. This is very much possible
> in a large dynamic environment but neither the initial or final situation is
> satisfactory.

That is indeed possible, but it would be more likely that the charge
is moved to B. As I said, it's not perfect, but it is an improvement
over what we have today. Even if C ends up owning it, it's better than
staying with the dead A.

>
> To solve the problems you're describing, you actually would have to
> guarantee that memory pages are charged to the current majority user (or
> maybe even spread across current active users). Maybe it can be argued that
> this is a step towards that but it's a very partial step and at least would
> need a technically viable direction that this development can follow.

Right, that would be a much larger effort (arguably memcg v3 ;) ).
This proposal is focused on the painful artifact of the sharing/sticky
resources problem: zombie memcgs. We can extend the automatic charge
movement semantics later to cover more cases or be smarter, or ditch
the existing charging semantics completely and start over with
sharing/stickiness in mind. Either way, that would be a long-term
effort. There is a problem that exists today though that ideally can
be fixed/improved by this proposal.

>
> On its own, AFAICS, I'm not sure the scope of problems it can actually solve
> is justifiably greater than what can be achieved with simple nesting.

In our use case nesting is not a viable option. As I said, in a large
fleet where a lot of different workloads are dynamically being
scheduled on different machines, and where there is no way of knowing
what resources are being shared among what workloads, and even if we
do, it wouldn't be constant, it's very difficult to construct the
hierarchy with nesting to keep the resources confined.

Keep in mind that the environment is dynamic, workloads are constantly
coming and going. Even if find the perfect nesting to appropriately
scope resources, some rescheduling may render the hierarchy obsolete
and require us to start over.

>
> Thanks.
>
> --
> tejun