Re: [PATCH 05/24] mm/swap: move readahead policy checking into swapin_readahead

From: Chris Li
Date: Tue Nov 21 2023 - 10:25:07 EST


On Tue, Nov 21, 2023 at 12:33 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > In a strange way it makes sense to charge to C.
> > Swap out == free up memory.
> > Swap in == consume memory.
> > C turn off swap, effectively this behavior will consume a lot of memory.
> > C gets charged, so if the C is out of memory, it will punish C.
> > C will not be able to continue swap in memory. The problem gets under control.
>
> Yes, I think charging either C or B makes sense in their own way. To
> me I think current behavior is kind of counter-intuitive.
>
> Image if there are cgroup PC1, and its child cgroup CC1, CC2. If a process
> swapped out some memory in CC1 then moved to CC2, and CC1 is dying.
> On swapoff the charge will be moved out of PC1...

Yes. In the spirit of punishing the one that is actively doing bad
deeds, the swapoff charge move out of PC1 makes sense, because the
memory effectively is not used in PC1 any more.

The question is why do you want to move the process from CC1 to CC2?
Move process between CGroup is not something supported very well in
the current CGroup model. The memory already charged to the current
CGroup will not move with the process.

> And swapoff often happens in some unlimited admin cgroup or some
> cgroup for management agents.
>
> If PC1 has a memory limit, the process in it can breach the limit easily,
> we will see a process that never left PC1 having a much higher RSS
> than PC1/CC1/CC2's limit.

Hmm, how do you set the limit on PC1? Do you set a hard limit
"memory.max"? What does the "PC1/memory.stat" show?
If you have a script to reproduce this behavior, I would love to learn more.

> And if there is a limit for the management agent cgroup, the agent
> will be OOM instead of OOM in PC1.

It seems what you want to do is have a way for the admin agent to turn
off the swapping on PC1 and let PC1 OOM, right?
In other words, you want the admin agent to turn off swapping in PC1's context.
How about start a bash script, add itself to PC1/cgroup.procs, then
that bash script calls swap off for that PC1.
Will that solve your problem?

I am still not sure why you want to turn off swap for PC1? If you know
PC1 is going to OOM, why not just kill it? A little bit more of the
context of the problem will help me understand your usage case.

> Simply moving a process between the child cgroup of the same parent
> cgroup won't cause a similar issue, things get weird when swapoff is
> involved.

Again, having a reproducing script is very helpful for understanding
what is going on. It can serve as a regression testing tool.

> And actually with multiple layers of swap, it's less risky to swapoff
> a device since other swap devices can catch over committed memory.

Sure. Do you have any feedback if we have "memory.swap.tiers", will
that address your need to control swap usage for individual cgroup?
>
> Oh, and there is one more case I forgot to cover in this series:
> Moving a process is indeed something not happening very frequently,
> but a process run in cgroup then exit, and leave some shmem swapped
> out could be a common case.

That is the zombie memcg problem with shared memory. The current
cgroup does work well with shared memory objects. It is another can of
worms.

> Current behavior on swapoff will move these charges out of the
> original parent cgroup too.
>
> So maybe a more ideal solution for swapoff is: simply always charge a
> dying cgroup parent cgroup?

That is the recharging idea. In this year's LSFMM there is a
presentation about zombie memcgs. Recharging/reparenting is one of the
proposals. If I recall correctly, someone from SUSE made some comment
that they tried it in their kernel at one point. But it had a lot of
issues and the feature was removed from the kernel eventually.

I also made a proposal for a shared memory cgroup model as well. I can
dig it up if you are interested.

> Maybe a sysctl/cmdline could be introduced to control the behavior.

I am against this kind of one off behavior change without a consistent
model of charging the memory. Even behind a sysctl, it is some code we
need to maintain and test function properly.
If you want to change the memory charging model, I would like to see
what is the high level principle it follows so we can reason it in
other usage cases consistently. The least thing I want to see is to
make up rules as we go, then we might end up with contradicting rules
and back ourselves to a corner in the future.
I do care about your usage case, I just want to address it consistently.
My understanding of the current memory cgroup model is charge to the
first use. Punish the trouble maker.

BTW, I am still reviewing your 24 patches series, half way through. It
will take me a bit of time to write up the reply.

Chris