Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

From: David Rientjes
Date: Thu Nov 15 2012 - 03:11:39 EST


On Wed, 14 Nov 2012, Anton Vorontsov wrote:

> Thanks again for your inspirational comments!
>

Heh, not sure I've been too inspirational (probably more annoying than
anything else). I really do want generic memory pressure notifications in
the kernel and already have some ideas on how I can tie it into our malloc
arenas, so please do keep working on it.

> I think I understand what you're saying, and surely it makes sense, but I
> don't know how you see this implemented on the API level.
>
> Getting struct {pid, pressure} pairs that cause the pressure at the
> moment? And the monitor only gets <pids> that are in the same cpuset? How
> about memcg limits?..
>

Depends on whether you want to support mempolicies or not and the argument
could go either way:

- FOR supporting mempolicies: memory that you're mbind() too can become
depleted and since there is no fallback then you have no way to prevent
lots of reclaim and/or invoking the oom killer, it would be
disappointing to not be able to get notifications of such a condition.

- AGAINST supporting mempolicies: you only need to support memory
isolation for cgroups (memcg and cpusets) and thus can implement your
own memory pressure cgroup that you can use to aggregate tasks and
then replace memcg memory thresholds with co-mounting this new cgroup
that would notify on an eventfd anytime one of the attached processes
experiences memory pressure.

> > Most processes aren't going to care if they are running into memory
> > pressure and have no implementation to free memory back to the kernel or
> > start ratelimiting themselves. They will just continue happily along
> > until they get the memory they want or they get oom killed. The ones that
> > do, however, or a job scheduler or monitor that is watching over the
> > memory usage of a set of tasks, will be able to do something when
> > notified.
>
> Yup, this is exactly how we want to use this. In Android we have "Activity
> Manager" thing, which acts exactly how you describe: it's a tasks monitor.
>

In addition to that, I think I can hook into our implementation of malloc
which frees memory back to the kernel with MADV_DONTNEED and zaps
individual ptes to poke holes in the memory it allocates to actually cache
the memory that we free() and then re-use it under normal circumstances to
return cache-hot memory on the next allocation but under memory pressure,
as triggered by your interface (but for threads attached to a memcg facing
memcg limits), drain the memory back to the kernel immediately.

> > In the hopes of a single API that can do all this and not a
> > reimplementation for various types of memory limitations (it seems like
> > what you're suggesting is at least three different APIs: system-wide via
> > vmpressure_fd, memcg via memcg thresholds, and cpusets through an eventual
> > cpuset threshold), I'm hoping that we can have a single interface that can
> > be polled on to determine when individual processes are encountering
> > memory pressure. And if I'm not running in your oom cpuset, I don't care
> > about your memory pressure.
>
> I'm not sure to what exactly you are opposing. :) You don't want to have
> three "kinds" pressures, or you don't what to have three different
> interfaces to each of them, or both?
>

The three pressures are a separate topic (I think it would be better to
have some measure of memory pressure similar to your reclaim scale and
allow users to get notifications at levels they define). I really dislike
having multiple interfaces that are all different from one another
depending on the context.

Given what we have right now with memory thresholds in memcg, if we were
to merge vmpressure_fd, then we're significantly limiting the usecase
since applications need not know if they are attached to a memcg or not:
it's a type of virtualization that the admin may setup but another admin
may be running unconstrained on a system with much more memory. So for
your usecase of a job monitor, that would work fine for global oom
conditions but the application no longer has an API to use if it wants to
know when it itself is feeling memory pressure.

I think others have voiced their opinion on trying to create a single API
for memory pressure notifications as well, it's just a hard problem and
takes a lot of work to determine how we can make it easy to use and
understand and extendable at the same time.

> > I don't understand, how would this work with cpusets, for example, with
> > vmpressure_fd as defined? The cpuset policy is embedded in the page
> > allocator and skips over zones that are not allowed when trying to find a
> > page of the specified order. Imagine a cpuset bound to a single node that
> > is under severe memory pressure. The reclaim logic will get triggered and
> > cause a notification on your fd when the rest of the system's nodes may
> > have tons of memory available.
>
> Yes, I see your point: we have many ways to limit resources, so it makes
> it hard to identify the cause of the "pressure" and thus how to deal with
> it, since the pressure might be caused by different kinds of limits, and
> freeing memory from one bucket doesn't mean that the memory will be
> available to the process that is requesting the memory.
>
> So we do want to know whether a specific cpuset is under pressure, whether
> a specific memcg is under pressure, or whether the system (and kernel
> itself) lacks memory.
>
> And we want to have a single API for this? Heh. :)
>

Might not be too difficult if you implement your own cgroup to aggregate
these tasks for which you want to know memory pressure events; it would
have to be triggered for the task trying to allocate memory at any given
time and how hard it was to allocate that memory in the slowpath, tie it
back to that tasks' memory pressure cgroup, and then report the trigger if
it's over a user-defined threshold normalized to the 0-100 scale. Then
you could co-mount this cgroup with memcg, cpusets, or just do it for the
root cgroup for users who want to monitor the entire system
(CONFIG_CGROUPS is enabled by default).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/