Re: [RFC PATCH 0/2] memory.low,min reclaim

From: Roman Gushchin
Date: Tue Apr 24 2018 - 06:10:11 EST


On Tue, Apr 24, 2018 at 12:56:09AM +0000, Greg Thelen wrote:
> On Mon, Apr 23, 2018 at 3:38 AM Roman Gushchin <guro@xxxxxx> wrote:
>
> > Hi, Greg!
>
> > On Sun, Apr 22, 2018 at 01:26:10PM -0700, Greg Thelen wrote:
> > > Roman's previously posted memory.low,min patches add per memcg effective
> > > low limit to detect overcommitment of parental limits. But if we flip
> > > low,min reclaim to bail if usage<{low,min} at any level, then we don't
> > > need an effective low limit, which makes the code simpler. When parent
> > > limits are overcommited memory.min will oom kill, which is more drastic but
> > > makes the memory.low a simpler concept. If memcg a/b wants oom kill before
> > > reclaim, then give it to them. It seems a bit strange for a/b/memory.low's
> > > behaviour to depend on a/c/memory.low (i.e. a/b.low is strong unless
> > > a/b.low+a/c.low exceed a.low).
>
> > It's actually not strange: a/b and a/c are sharing a common resource:
> > a/memory.low.
>
> > Exactly as a/b/memory.max and a/c/memory.max are sharing a/memory.max.
> > If there are sibling cgroups which are consuming memory, a cgroup can't
> > exceed parent's memory.max, even if its memory.max is grater.
>
> > >
> > > I think there might be a simpler way (ableit it doesn't yet include
> > > Documentation):
> > > - memcg: fix memory.low
> > > - memcg: add memory.min
> > > 3 files changed, 75 insertions(+), 6 deletions(-)
> > >
> > > The idea of this alternate approach is for memory.low,min to avoid
> reclaim
> > > if any portion of under-consideration memcg ancestry is under respective
> > > limit.
>
> > This approach has a significant downside: it breaks hierarchical
> constraints
> > for memory.low/min. There are two important outcomes:
>
> > 1) Any leaf's memory.low/min value is respected, even if parent's value
> > is lower or even 0. It's not possible anymore to limit the amount
> of
> > protected memory for a sub-tree.
> > This is especially bad in case of delegation.
>
> As someone who has been using something like memory.min for a while, I have
> cases where it needs to be a strong protection. Such jobs prefer oom kill
> to reclaim. These jobs know they need X MB of memory. But I guess it's on
> me to avoid configuring machines which overcommit memory.min at such cgroup
> levels all the way to the root.

Absolutely.

>
> > 2) If a cgroup has an ancestor with the usage under its memory.low/min,
> > it becomes protection, even if its memory.low/min is 0. So it
> becomes
> > impossible to have unprotected cgroups in protected sub-tree.
>
> Fair point.
>
> One use case is where a non trivial job which has several memory accounting
> subcontainers. Is there a way to only set memory.low at the top and have
> the offer protection to the job?
> The case I'm thinking of is:
> % cd /cgroup
> % echo +memory > cgroup.subtree_control
> % mkdir top
> % echo +memory > top/cgroup.subtree_control
> % mkdir top/part1 top/part2
> % echo 1GB > top/memory.min
> % (echo $BASHPID > top/part1/cgroup.procs && part1)
> % (echo $BASHPID > top/part2/cgroup.procs && part2)
>
> Empirically it's been measured that the entire workload (/top) needs 1GB to
> perform well. But we don't care how the memory is distributed between
> part1,part2. Is the strategy for that to set /top, /top/part1.min, and
> /top/part2.min to 1GB?

The problem is that right now we don't have an "undefined" value for
memory.min/low. The default value is 0, which means "no protection".
So there is no way how a user can express "whatever parent cgroup wants".
It might be useful to introduce such value, as other controllers
may benefit too. But it's a separate theme to discuss.

In your example, it's possible to achieve the requested behavior by setting
top.min into 1G and part1.min and part2.min into "max".

>
> What do you think about exposing emin and elow to user space? I think that
> would reduce admin/user confusion in situations where memory.min is
> internally discounted.

They might be useful in some cases (e.g. a cgroup want's to know how much
actual protection it can get), but at the same time these values are
intentionally racy and don't have a clear semantics.
So, maybe we can show them in memory.stat, but I doubt that they deserve
a separate interface file.

>
> (tangent) Delegation in v2 isn't something I've been able to fully
> internalize yet.
> The "no interior processes" rule challenges my notion of subdelegation.
> My current model is where a system controller creates a container C with
> C.min and then starts client manager process M in C. Then M can choose
> to further divide C's resources (e.g. C/S). This doesn't seem possible
> because v2 doesn't allow for interior processes. So the system manager
> would need to create C, set C.low, create C/sub_manager, create
> C/sub_resources, set C/sub_manager.low, set C/sub_resources.low, then start
> M in C/sub_manager. Then sub_manager can create and manage
> C/sub_resources/S.

And this is a good example of a case, when some cgroups in the tree
should be protected to work properly (for example, C/sub_manager/memory.low = 128M),
while an actual workload might be not (C/sub_resources/memory.low = 0).

Thanks!