Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection

From: Johannes Weiner
Date: Wed Feb 26 2020 - 10:05:57 EST


Hello,

On Wed, Feb 26, 2020 at 02:22:37PM +0100, Michal Koutný wrote:
> On Tue, Feb 25, 2020 at 10:03:04AM -0500, Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > Can you explain why you think protection is different from a weight?
> - weights are dimension-less, they represent no real resource

They still ultimately translate to real resources. The concrete value
depends on what the parent's weight translates to, and it depends on
sibling configurations and their current consumption. (All of this is
already true for memory protection as well, btw). But eventually, a
weight specification translates to actual time on a CPU, bandwidth on
an IO device etc.

> - sum of sibling weights is meaningless (and independent from parent
> weight)

Technically true for overcommitted memory.low values as well.

> - to me this protection is closer to limits (actually I like your simile
> that they're very lazily enforced limits)

But weights are also lazily enforced limits. Without competition, you
can get 100% regardless of your weight; under contention, you get
throttled/limited back to an assigned share, however that's specified.

Once you peel away the superficial layer of how resources, or shares
of resources, are being referred to (time, bytes, relative shares)
weights/guarantees/protections are all the same thing: they are lazily
enforced* partitioning rules of a resource under contention.

I don't see a fundamental difference between them. And that in turn
makes it hard for me to accept that hierarchical inheritance rules
should be different.

* We also refer to this as work-conserving

> > Now you apply memory pressure, what happens?. D isn't reclaimed, C is
> > somewhat reclaimed, E is reclaimed hard. D will not page, C will page
> > a little bit, E will page hard *with the higher IO priority of B*.
> >
> > Now C is stuck behind E. This is a priority inversion.
> This is how I understand the weights to work.
>
> A
> `- B io.weight=200
> `- D io.weight=100 (e.g.)
> `- E io.weight=100 (e.g.)
> `- C io.weight=50
>
> Whatever weights I assign to D and E, when only E and C compete, E will
> have higher weight (200 to 50, work-conservacy of weights).

Yes, exactly. I'm saying the same should be true for memory.

> I don't think this inversion is wrong because E's work is still on
> behalf of B.

"Wrong" isn't the right term. Is it what you wanted to express in your
configuration?

What's the point of designating E a memory donor group that needs to
relinquish memory to C under pressure, but when it actually gives up
that memory it beats C in competition over a different resource?

You are talking about a mathematical truth on a per-controller
basis. What I'm saying is that I don't see how this is useful for real
workloads, their relative priorities, and the performance expectations
users have from these priorities.

With a priority inversion like this, there is no actual performance
isolation or containerization going on here - which is the whole point
of cgroups and resource control.

> Or did you mean that if protections were transformed (via effective
> calculation) to have ratios only in the same range as io.weights
> (1e-4..1e4 instead of 0..inf), then it'd prevent the inversion? (By
> setting D,E weights in same ratios as D,E protections.)

No, the inversion would be prevented if E could consume all resources
assigned to B that aren't consumed by D.

This is true for IO and CPU, but before my patch not for memory.

> > 1. Can you please make a practical use case for having scape goats or
> > donor groups to justify retaining what I consider to be an
> > unimportant artifact in the memory.low semantics?
> A.low=10G
> `- B.low=X u=6G
> `- C.low=X u=4G
> `- D.low=0G u=5G
>
> B,C run the workload which should be protected
> D runs job that doesn't need any protection
> u denotes usage
> (I made the example with more than one important sibling to illustrate
> usefulness of some implicit distribution X.)
>
> When outer reclaim comes, reclaiming from B,C would be detrimental to
> their performance, while impact on D is unimportant. (And induced IO
> load on the rest (out of A) too.)

Okay, but this is a different usecase than we were talking about.

My objection is to opting out of protection against cousins (thus
overriding parental resource assignment), not against siblings.

Expressing priorities between siblings like this is fine. And I
absolutely see practical value in your specific example.

> It's not possible to move D to the A's level, since only A is all what a
> given user can control.

Correct, but you can change the tree to this:

A.low=10G
`- A1.low=10G
`- B.low=0G
`- C.low=0G
`- D.low=0G

to express

A1 > D
B = C

That priority order can be matched by CPU and IO controls as well:

A.weight=100
`- A1.weight=100
`- B.weight=100
`- C.weight=100
`- D.weight=1

My objection is purely about opting out of resources relative to (and
assuming a lower memory priority than) an outside cousin that may have
a lower priority on other resources.

That is, I would like to see an argument for this setup:

A
`- B io.weight=200 memory.low=10G
`- D io.weight=100 (e.g.) memory.low=10G
`- E io.weight=100 (e.g.) memory.low=0
`- C io.weight=50 memory.low=5G

Where E has no memory protection against C, but E has IO priority over
C. That's the configuration that cannot be expressed with a recursive
memory.low, but since it involves priority inversions it's not useful
to actually isolate and containerize workloads.

That's why I'm saying it's an artifact, not an actual feature.

> > 2. If you think opting out of hierarchically assigned resources is a
> > fundamentally important usecase, can you please either make an
> > argument why it should also apply to CPU and IO, or alternatively
> > explain in detail why they are meaningfully different?
> I'd say that protected memory is a disposable resource in contrast with
> CPU/IO. If you don't have latter, you don't progress; if you lack the
> former, you are refaulting but can make progress. Even more, you should
> be able to give up memory.min.

Eh, I'm not buying that. You cannot run without memory either. If
somebody reclaims a page between you faulting it in and you resuming
to userspace, there is no forward progress.