Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work

From: Michal Hocko
Date: Wed Jan 16 2019 - 02:02:27 EST


On Tue 15-01-19 11:38:23, Shakeel Butt wrote:
> On Mon, Jan 14, 2019 at 11:25 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> >
> > On Mon 14-01-19 12:18:07, Shakeel Butt wrote:
> > > On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> > > >
> > > > On Fri 11-01-19 14:54:32, Shakeel Butt wrote:
> > > > > Hi Johannes,
> > > > >
> > > > > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > > > > >
> > > > > > Hi Shakeel,
> > > > > >
> > > > > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > > > > > > If a memcg is over high limit, memory reclaim is scheduled to run on
> > > > > > > return-to-userland. However it is assumed that the memcg is the current
> > > > > > > process's memcg. With remote memcg charging for kmem or swapping in a
> > > > > > > page charged to remote memcg, current process can trigger reclaim on
> > > > > > > remote memcg. So, schduling reclaim on return-to-userland for remote
> > > > > > > memcgs will ignore the high reclaim altogether. So, record the memcg
> > > > > > > needing high reclaim and trigger high reclaim for that memcg on
> > > > > > > return-to-userland. However if the memcg is already recorded for high
> > > > > > > reclaim and the recorded memcg is not the descendant of the the memcg
> > > > > > > needing high reclaim, punt the high reclaim to the work queue.
> > > > > >
> > > > > > The idea behind remote charging is that the thread allocating the
> > > > > > memory is not responsible for that memory, but a different cgroup
> > > > > > is. Why would the same thread then have to work off any high excess
> > > > > > this could produce in that unrelated group?
> > > > > >
> > > > > > Say you have a inotify/dnotify listener that is restricted in its
> > > > > > memory use - now everybody sending notification events from outside
> > > > > > that listener's group would get throttled on a cgroup over which it
> > > > > > has no control. That sounds like a recipe for priority inversions.
> > > > > >
> > > > > > It seems to me we should only do reclaim-on-return when current is in
> > > > > > the ill-behaved cgroup, and punt everything else - interrupts and
> > > > > > remote charges - to the workqueue.
> > > > >
> > > > > This is what v1 of this patch was doing but Michal suggested to do
> > > > > what this version is doing. Michal's argument was that the current is
> > > > > already charging and maybe reclaiming a remote memcg then why not do
> > > > > the high excess reclaim as well.
> > > >
> > > > Johannes has a good point about the priority inversion problems which I
> > > > haven't thought about.
> > > >
> > > > > Personally I don't have any strong opinion either way. What I actually
> > > > > wanted was to punt this high reclaim to some process in that remote
> > > > > memcg. However I didn't explore much on that direction thinking if
> > > > > that complexity is worth it. Maybe I should at least explore it, so,
> > > > > we can compare the solutions. What do you think?
> > > >
> > > > My question would be whether we really care all that much. Do we know of
> > > > workloads which would generate a large high limit excess?
> > > >
> > >
> > > The current semantics of memory.high is that it can be breached under
> > > extreme conditions. However any workload where memory.high is used and
> > > a lot of remote memcg charging happens (inotify/dnotify example given
> > > by Johannes or swapping in tmpfs file or shared memory region) the
> > > memory.high breach will become common.
> >
> > This is exactly what I am asking about. Is this something that can
> > happen easily? Remote charges on themselves should be rare, no?
> >
>
> At the moment, for kmem we can do remote charging for fanotify,
> inotify and buffer_head and for anon pages we can do remote charging
> on swap in. Now based on the workload's cgroup setup the remote
> charging can be very frequent or rare.
>
> At Google, remote charging is very frequent but since we are still on
> cgroup-v1 and do not use memory.high, the issue this patch is fixing
> is not observed. However for the adoption of cgroup-v2, this fix is
> needed.

Adding some numbers into the changelog would be really valuable to judge
the urgency and the scale of the problem. If we are going via kworker
then it is also important to evaluate what kind of effect on the system
this has. How big of the excess can we get? Why don't those memcgs
resolve the excess by themselves on the first direct charge? Is it
possible that kworkers simply swamp the system with many parallel memcgs
with remote charges?

In other words we need deeper analysis of the problem and the solution.
--
Michal Hocko
SUSE Labs