Re: [PATCH 23/24] swap: fix multiple swap leak when after cgroup migrate

From: Kairui Song
Date: Mon Nov 20 2023 - 06:19:49 EST


Huang, Ying <ying.huang@xxxxxxxxx> 于2023年11月20日周一 15:37写道:
>
> Kairui Song <ryncsn@xxxxxxxxx> writes:
>
> > From: Kairui Song <kasong@xxxxxxxxxxx>
> >
> > When a process which previously swapped some memory was moved to
> > another cgroup, and the cgroup it previous in is dead, then swapped in
> > pages will be leaked into rootcg. Previous commits fixed the bug for
> > no readahead path, this commit fix the same issue for readahead path.
> >
> > This can be easily reproduced by:
> > - Setup a SSD or HDD swap.
> > - Create memory cgroup A, B and C.
> > - Spawn process P1 in cgroup A and make it swap out some pages.
> > - Move process P1 to memory cgroup B.
> > - Destroy cgroup A.
> > - Do a swapoff in cgroup C
> > - Swapped in pages is accounted into cgroup C.
> >
> > This patch will fix it make the swapped in pages accounted in cgroup B.
>
> Accroding to "Memory Ownership" section of
> Documentation/admin-guide/cgroup-v2.rst,
>
> "
> A memory area is charged to the cgroup which instantiated it and stays
> charged to the cgroup until the area is released. Migrating a process
> to a different cgroup doesn't move the memory usages that it
> instantiated while in the previous cgroup to the new cgroup.
> "
>
> Because we don't move the charge when we move a task from one cgroup to
> another. It's controversial which cgroup should be charged to.
> According to the above document, it's acceptable to charge to the cgroup
> C (cgroup where swapoff happens).

Hi Ying, thank you very much for the info!

It is controversial indeed, just the original behavior is kind of
counter-intuitive.

Image if there are cgroup P1, and its child cgroup C1 C2. If a process
swapped out some memory in C1 then moved to C2, and C1 is dead.
On swapoff the charge will be moved out of P1...

And swapoff often happen on some unlimited cgroup or some cgroup for
management agent.

If P1 have a memory limit, it can breech the limit easily, we will see
a process that never leave P1 having a much higher RSS that P1/C1/C2's
limit.
And if there is a limit for the management agent cgroup, the agent
will be OOM instead of OOM in P1.

Simply moving a process between the child cgroup of the same parent
cgroup won't cause such issue, thing get weird when swapoff is
involved.

Or maybe we should try to be compatible, and introduce a sysctl or
cmdline for this?