Re: [PATCH v4 0/4] Deterministic charging of shared memory

From: Roman Gushchin
Date: Mon Nov 22 2021 - 18:10:17 EST


On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote:
> On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote:
> > Problem:
> > Currently shared memory is charged to the memcg of the allocating
> > process. This makes memory usage of processes accessing shared memory
> > a bit unpredictable since whichever process accesses the memory first
> > will get charged. We have a number of use cases where our userspace
> > would like deterministic charging of shared memory:
> >
> > 1. System services allocating memory for client jobs:
> > We have services (namely a network access service[1]) that provide
> > functionality for clients running on the machine and allocate memory
> > to carry out these services. The memory usage of these services
> > depends on the number of jobs running on the machine and the nature of
> > the requests made to the service, which makes the memory usage of
> > these services hard to predict and thus hard to limit via memory.max.
> > These system services would like a way to allocate memory and instruct
> > the kernel to charge this memory to the client’s memcg.
> >
> > 2. Shared filesystem between subtasks of a large job
> > Our infrastructure has large meta jobs such as kubernetes which spawn
> > multiple subtasks which share a tmpfs mount. These jobs and its
> > subtasks use that tmpfs mount for various purposes such as data
> > sharing or persistent data between the subtask restarts. In kubernetes
> > terminology, the meta job is similar to pods and subtasks are
> > containers under pods. We want the shared memory to be
> > deterministically charged to the kubernetes's pod and independent to
> > the lifetime of containers under the pod.
> >
> > 3. Shared libraries and language runtimes shared between independent jobs.
> > We’d like to optimize memory usage on the machine by sharing libraries
> > and language runtimes of many of the processes running on our machines
> > in separate memcgs. This produces a side effect that one job may be
> > unlucky to be the first to access many of the libraries and may get
> > oom killed as all the cached files get charged to it.
> >
> > Design:
> > My rough proposal to solve this problem is to simply add a
> > ‘memcg=/path/to/memcg’ mount option for filesystems:
> > directing all the memory of the file system to be ‘remote charged’ to
> > cgroup provided by that memcg= option.
> >
> > Caveats:
> >
> > 1. One complication to address is the behavior when the target memcg
> > hits its memory.max limit because of remote charging. In this case the
> > oom-killer will be invoked, but the oom-killer may not find anything
> > to kill in the target memcg being charged. Thera are a number of considerations
> > in this case:
> >
> > 1. It's not great to kill the allocating process since the allocating process
> > is not running in the memcg under oom, and killing it will not free memory
> > in the memcg under oom.
> > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
> > somehow. If not, the process will forever loop the pagefault in the upstream
> > kernel.
> >
> > In this case, I propose simply failing the remote charge and returning an ENOSPC
> > to the caller. This will cause will cause the process executing the remote
> > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
> > path. This will be documented behavior of remote charging, and this feature is
> > opt-in. Users can:
> > - Not opt-into the feature if they want.
> > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
> > abort if they desire.
> > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
> > operation without executing the remote charge if possible.
> >
> > 2. Only processes allowed the enter cgroup at mount time can mount a
> > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups
> > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any
> > process with write access to this mount point will be able to charge memory to
> > <cgroup>. This is largely a non-issue because in configurations where there is
> > untrusted code running on the machine, mount point access needs to be
> > restricted to the intended users only regardless of whether the mount point
> > memory is deterministly charged or not.
>
> I'm not a fan of this. It uses filesystem mounts to create shareable
> resource domains outside of the cgroup hierarchy, which has all the
> downsides you listed, and more:
>
> 1. You need a filesystem interface in the first place, and a new
> ad-hoc channel and permission model to coordinate with the cgroup
> tree, which isn't great. All filesystems you want to share data on
> need to be converted.
>
> 2. It doesn't extend to non-filesystem sources of shared data, such as
> memfds, ipc shm etc.
>
> 3. It requires unintuitive configuration for what should be basic
> shared accounting semantics. Per default you still get the old
> 'first touch' semantics, but to get sharing you need to reconfigure
> the filesystems?
>
> 4. If a task needs to work with a hierarchy of data sharing domains -
> system-wide, group of jobs, job - it must interact with a hierarchy
> of filesystem mounts. This is a pain to setup and may require task
> awareness. Moving data around, working with different mount points.
> Also, no shared and private data accounting within the same file.
>
> 5. It reintroduces cgroup1 semantics of tasks and resouces, which are
> entangled, sitting in disjunct domains. OOM killing is one quirk of
> that, but there are others you haven't touched on. Who is charged
> for the CPU cycles of reclaim in the out-of-band domain? Who is
> charged for the paging IO? How is resource pressure accounted and
> attributed? Soon you need cpu= and io= as well.
>
> My take on this is that it might work for your rather specific
> usecase, but it doesn't strike me as a general-purpose feature
> suitable for upstream.
>
>
> If we want sharing semantics for memory, I think we need a more
> generic implementation with a cleaner interface.
>
> Here is one idea:
>
> Have you considered reparenting pages that are accessed by multiple
> cgroups to the first common ancestor of those groups?
>
> Essentially, whenever there is a memory access (minor fault, buffered
> IO) to a page that doesn't belong to the accessing task's cgroup, you
> find the common ancestor between that task and the owning cgroup, and
> move the page there.
>
> With a tree like this:
>
> root - job group - job
> `- job
> `- job group - job
> `- job
>
> all pages accessed inside that tree will propagate to the highest
> level at which they are shared - which is the same level where you'd
> also set shared policies, like a job group memory limit or io weight.
>
> E.g. libc pages would (likely) bubble to the root, persistent tmpfs
> pages would bubble to the respective job group, private data would
> stay within each job.
>
> No further user configuration necessary. Although you still *can* use
> mount namespacing etc. to prohibit undesired sharing between cgroups.
>
> The actual user-visible accounting change would be quite small, and
> arguably much more intuitive. Remember that accounting is recursive,
> meaning that a job page today also shows up in the counters of job
> group and root. This would not change. The only thing that IS weird
> today is that when two jobs share a page, it will arbitrarily show up
> in one job's counter but not in the other's. That would change: it
> would no longer show up as either, since it's not private to either;
> it would just be a job group (and up) page.

In general I like the idea, but I think the user-visible change will be quite
large, almost "cgroup v3"-large. Here are some problems:
1) Anything shared between e.g. system.slice and user.slice now belongs
to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache
belonging to shared libraries.
2) It's concerning in security terms. If I understand the idea correctly, a
read-only access will allow to move charges to an upper level, potentially
crossing memory.max limits. It doesn't sound safe.
3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent
it returns us to the cgroup v1 world and a question of competition between
resources consumed by a cgroup directly and through children cgroups. Not
like the problem doesn't exist now, but it's less pronounced.
If say >50% of system.slice's memory will belong to system.slice directly,
then we likely will need separate non-recursive counters, limits, protections,
etc.
4) Imagine a production server and a system administrator entering using ssh
(and being put into user.slice) and running a big grep... It screws up all
memory accounting until a next reboot. Not a completely impossible scenario.

That said, I agree with Johannes and I'm also not a big fan of this patchset.

I agree that the problem exist and that the patchset provides a solution, but
it doesn't look nice (and generic enough) and creates a lot of questions and
corner cases.

Btw, won't (an optional) disabling of memcg accounting for a tmpfs solve your
problem? It will be less invasive and will not require any oom changes.