Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

From: Johannes Weiner
Date: Thu Oct 29 2015 - 13:52:50 EST

Next message: Jeff Layton: "Re: timer code oops when calling mod_delayed_work"
Previous message: Mathieu Desnoyers: "Re: [for-next][PATCH 1/5] tracepoint: Give priority to probes of tracepoints"
In reply to: Vladimir Davydov: "Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Oct 29, 2015 at 12:27:47PM +0300, Vladimir Davydov wrote:
> On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote:
> > Having the hard limit as a failsafe (or a minimum for other consumers)
> > is one thing, and certainly something I'm open to for cgroupv2, should
> > we have problems with load startup up after a socket memory landgrab.
> >
> > That being said, if the VM is struggling to reclaim pages, or is even
> > swapping, it makes perfect sense to let the socket memory scheduler
> > know it shouldn't continue to increase its footprint until the VM
> > recovers. Regardless of any hard limitations/minimum guarantees.
> >
> > This is what my patch does and it seems pretty straight-forward to
> > me. I don't really understand why this is so controversial.
>
> I'm not arguing that the idea behind this patch set is necessarily bad.
> Quite the contrary, it does look interesting to me. I'm just saying that
> IMO it can't replace hard/soft limits. It probably could if it was
> possible to shrink buffers, but I don't think it's feasible, even
> theoretically. That's why I propose not to change the behavior of the
> existing per memcg tcp limit at all. And frankly I don't get why you are
> so keen on simplifying it. You say it's a "crapload of boilerplate
> code". Well, I don't see how it is - it just replicates global knobs and
> I don't see how it could be done in a better way. The code is hidden
> behind jump labels, so the overhead is zero if it isn't used. If you
> really dislike this code, we can isolate it under a separate config
> option. But all right, I don't rule out the possibility that the code
> could be simplified. If you do that w/o breaking it, that'll be OK to
> me, but I don't see why it should be related to this particular patch
> set.

Okay, I see your concern.

I'm not trying to change the behavior, just the implementation,
because it's too complex for the functionality it actually provides.

And the reason it's part of this patch set is because I'm using the
same code to hook into the memory accounting, so it makes sense to
refactor this stuff in the same go.

There is also a niceness factor of not adding more memcg callbacks to
the networking subsystem when there is an option to consolidate them.

Now, you mentioned that you'd rather see the socket buffers accounted
at the allocator level, but I looked at the different allocation paths
and network protocols and I'm not convinced that this makes sense. We
don't want to be in the hotpath of every single packet when a lot of
them are small, short-lived management blips that don't involve user
space to let the kernel dispose of them.

__sk_mem_schedule() on the other hand is already wired up to exactly
those consumers we are interested in for memory isolation: those with
bigger chunks of data attached to them and those that have exploding
receive queues when userspace fails to read(). UDP and TCP.

I mean, there is a reason why the global memory limits apply to only
those types of packets in the first place: everything else is noise.

I agree that it's appealing to account at the allocator level and set
page->mem_cgroup etc. but in this case we'd pay extra to capture a lot
of noise, and I don't want to pay that just for aesthetics. In this
case it's better to track ownership on the socket level and only count
packets that can accumulate a significant amount of memory consumed.

> > We tried using the per-memcg tcp limits, and that prevents the OOMs
> > for sure, but it's horrendous for network performance. There is no
> > "stop growing" phase, it just keeps going full throttle until it hits
> > the wall hard.
> >
> > Now, we could probably try to replicate the global knobs and add a
> > per-memcg soft limit. But you know better than anyone else how hard it
> > is to estimate the overall workingset size of a workload, and the
> > margins on containerized loads are razor-thin. Performance is much
> > more sensitive to input errors, and often times parameters must be
> > adjusted continuously during the runtime of a workload. It'd be
> > disasterous to rely on yet more static, error-prone user input here.
>
> Yeah, but the dynamic approach proposed in your patch set doesn't
> guarantee we won't hit OOM in memcg due to overgrown buffers. It just
> reduces this possibility. Of course, memcg OOM is far not as disastrous
> as the global one, but still it usually means the workload breakage.

Right now, the entire machine breaks. Confining it to a faulty memcg,
as well as reducing the likelihood of that OOM in many cases seems
like a good move in the right direction, no?

And how likely are memcg OOMs because of this anyway? There is of
course a scenario imaginable where the packets pile up, followed by
some *other* part of the workload, the one that doesn't read() and
process packets, trying to expand--which then doesn't work and goes
OOM. But that seems like a complete corner case. In the vast majority
of cases, the application will be in full operation and just fail to
read() fast enough--because the network bandwidth is enormous compared
to the container's size, or because it shares the CPU with thousands
of other workloads and there is scheduling latency.

This would be the perfect point to reign in the transmit window...

> The static approach is error-prone for sure, but it has existed for
> years and worked satisfactory AFAIK.

...but that point is not a fixed amount of memory consumed. It depends
on the workload and the random interactions it's having with thousands
of other containers on that same machine.

The point of containers is to maximize utilization of your hardware
and systematically eliminate slack in the system. But it's exactly
that slack on dedicated bare-metal machines that allowed us to take a
wild guess at the settings and then tune them based on observing a
handful of workloads. This approach is not going to work anymore when
we pack the machine to capacity and still expect every single
container out of thousands to perform well. We need that automation.

The static setting working okay on the global level is also why I'm
not interested in starting to experiment with it. There is no reason
to change it. It's much more likely that any attempt to change it will
be shot down, not because of the approach chosen, but because there is
no problem to solve there. I doubt we can get networking people to
care about containers by screwing with things that work for them ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jeff Layton: "Re: timer code oops when calling mod_delayed_work"
Previous message: Mathieu Desnoyers: "Re: [for-next][PATCH 1/5] tracepoint: Give priority to probes of tracepoints"
In reply to: Vladimir Davydov: "Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]