Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

From: Michal Hocko
Date: Tue Sep 01 2015 - 11:01:30 EST


On Tue 01-09-15 16:40:03, Vladimir Davydov wrote:
> On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote:
> > On Mon 31-08-15 17:20:49, Vladimir Davydov wrote:
{...}
> > > 1. SLAB. Suppose someone calls kmalloc_node and there is enough free
> > > memory on the preferred node. W/o memcg limit set, the allocation
> > > will happen from the preferred node, which is OK. If there is memcg
> > > limit, we can currently fail to allocate from the preferred node if
> > > we are near the limit. We issue memcg reclaim and go to fallback
> > > alloc then, which will most probably allocate from a different node,
> > > although there is no reason for that. This is a bug.
> >
> > I am not familiar with the SLAB internals much but how is it different
> > from the global case. If the preferred node is full then __GFP_THISNODE
> > request will make it fail early even without giving GFP_NOWAIT
> > additional access to atomic memory reserves. The fact that memcg case
> > fails earlier is perfectly expected because the restriction is tighter
> > than the global case.
>
> memcg restrictions are orthogonal to NUMA: failing an allocation from a
> particular node does not mean failing memcg charge and vice versa.

Sure memcg doesn't care about NUMA it just puts an additional constrain
on top of all existing ones. The point I've tried to make is that the
logic is currently same whether it is page allocator (with the node
restriction) or memcg (cumulative amount restriction) are behaving
consistently. Neither of them try to reclaim in order to achieve its
goals. How conservative is memcg about allowing GFP_NOWAIT allocation
is a separate issue and all those details belong to memcg proper same
as the allocation strategy for these allocations belongs to the page
allocator.

> > How the fallback is implemented and whether trying other node before
> > reclaiming from the preferred one is reasonable I dunno. This is for
> > SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
> > for memcg enabled setups subtly different. And that is bad.
>
> Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
> inspecting if a NUMA node has free pages makes SLAB behaviour subtly
> differently: SLAB will walk over all NUMA nodes for nothing instead of
> invoking memcg reclaim once a free page is found.

So you are saying that the SLAB kmem accounting in this particular path
is suboptimal because the fallback mode doesn't retry local node with
the reclaim enabled before falling back to other nodes?
I would consider it quite surprising as well even for the global case
because __GFP_THISNODE doesn't wake up kswapd to make room on that node.

> You are talking about memcg/kmem accounting as if it were done in the
> buddy allocator on top of which the slab layer is built knowing nothing
> about memcg accounting on the lower layer. That's not true and that
> simply can't be true. Kmem accounting is implemented at the slab layer.
> Memcg provides its memcg_charge_slab/uncharge methods solely for
> slab core, so it's OK to have some calling conventions between them.
> What we are really obliged to do is to preserve behavior of slab's
> external API, i.e. kmalloc and friends.

I guess I understand what you are saying here but it sounds like special
casing which tries to be clever because the current code understands
both the lower level allocator and kmem charge paths to decide how to
juggle with them. This is imho bad and hard to maintain long term.

> > > 2. SLUB. Someone calls kmalloc and there is enough free high order
> > > pages. If there is no memcg limit, we will allocate a high order
> > > slab page, which is in accordance with SLUB internal logic. With
> > > memcg limit set, we are likely to fail to charge high order page
> > > (because we currently try to charge high order pages w/o __GFP_WAIT)
> > > and fallback on a low order page. The latter is unexpected and
> > > unjustified.
> >
> > And this case very similar and I even argue that it shows more
> > brokenness with your patch. The SLUB allocator has _explicitly_ asked
> > for an allocation _without_ reclaim because that would be unnecessarily
> > too costly and there is other less expensive fallback. But memcg would
>
> You are ignoring the fact that, in contrast to alloc_pages, for memcg
> there is practically no difference between charging a 4-order page or a
> 1-order page.

But this is an implementation details which might change anytime in
future.

> OTOH, using 1-order pages where we could go with 4-order
> pages increases page fragmentation at the global level. This subtly
> breaks internal SLUB optimization. Once again, kmem accounting is not
> something staying aside from slab core, it's a part of slab core.

This is certainly true and it is what you get when you put an additional
constrain on top of an existing one. You simply cannot get both the
great performance _and_ a local memory restriction.

> > be ignoring this with your patch AFAIU and break the optimization. There
> > are other cases like that. E.g. THP pages are allocated without GFP_WAIT
> > when defrag is disabled.
>
> It might be wrong. If we can't find a continuous 2Mb page, we should
> probably give up instead of calling compactor. For memcg it might be
> better to reclaim some space for 2Mb page right now and map a 2Mb page
> instead of reclaiming space for 512 4Kb pages a moment later, because in
> memcg case there is absolutely no difference between reclaiming 2Mb for
> a huge page and 2Mb for 512 4Kb pages.

Or maybe the whole reclaim just doesn't pay off because the TLB savings
will never compensate for the reclaim. The defrag knob basically says
that we shouldn't try to opportunistically prepare a room for the THP
page.

> > > That being said, this is the fix at the right layer.
> > >
> > > > Either we should start failing GFP_NOWAIT charges when we are above
> > > > high wmark or deploy an additional catchup mechanism as suggested by
> > > > Tejun.
> > >
> > > The mechanism proposed by Tejun won't help us to avoid allocation
> > > failures if we are hitting memory.max w/o __GFP_WAIT or __GFP_FS.
> >
> > Why would be that a problem. The _hard_ limit is reached and reclaim
> > cannot make any progress. An allocation failure is to be expected.
> > GFP_NOWAIT will fail normally and GFP_NOFS will attempt to reclaim
> > before failing.
>
> Quoting my e-mail to Tejun explaining why using task_work won't help if
> we don't fix SLAB/SLUB:
>
> : Generally speaking, handing over reclaim responsibility to task_work
> : won't help, because there might be cases when a process spends quite a
> : lot of time in kernel invoking lots of GFP_KERNEL allocations before
> : returning to userspace. Without fixing slab/slub, such a process will
> : charge w/o __GFP_WAIT and therefore can exceed memory.high and reach
> : memory.max. If there are no other active processes in the cgroup, the
> : cgroup can stay with memory.high excess for a relatively long time
> : (suppose the process was throttled in kernel), possibly hurting the rest
> : of the system. What is worse, if the process happens to invoke a real
> : GFP_NOWAIT allocation when it's about to hit the limit, it will fail.
>
> For a kmalloc user that's completely unexpected.

We have the global reclaim which handles the global memory pressure. And
until the hard limit is enforced I do not see what is the huge problem
here. Sure we can have high limit in excess but that is to be expected.
Same as failing allocations for the hard limit enforcement.

Maybe moving whole high limit reclaim to the delayed context is not what
we will end up with and reduce this only for GFP_NOWAIT or other weak
reclaim contexts. This is to be discussed of course.

> > > To fix GFP_NOFS/GFP_NOWAIT failures we just need to start reclaim when
> > > the gap between limit and usage is getting too small. It may be done
> > > from a workqueue or from task_work, but currently I don't see any reason
> > > why complicate and not just start reclaim directly, just like
> > > memory.high does.
> >
> > Yes we can do better than we do right now. But that doesn't mean we
> > should put hacks all over the place and lie about the allocation
> > context.
>
> What do you mean by saying "all over the place"? It's a fix for kmem
> implementation, to be more exact for the part of it residing in the slab
> core.

I meant into two slab allocators currently because of the implementation
details which are spread into three different places - page allocator,
memcg charging code and the respective slab allocator specific details.

> Everyone else, except a couple of kmem users issuing alloc_page
> directly like threadinfo, will use kmalloc and know nothing what's going
> on there and how all this accounting stuff is handled - they will just
> use plain old convenient kmalloc, which works exactly as it does in the
> root cgroup.

If we ever grow more users and charge more kernel memory then they might
be doing similar assumptions and tweak allocation/charge context and we
would end up in a bigger mess. It makes much more sense to have
allocation and charge context consistent.

--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/