Re: [BUG] trigger BUG_ON in mas_store_prealloc when low memory

From: John Hsu (許永翰)
Date: Mon Jul 10 2023 - 08:50:10 EST


On Thu, 2023-07-06 at 14:54 -0400, Liam R. Howlett wrote:
>
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>
> Apologies for the late response.
>
> * John Hsu (許永翰) <John.Hsu@xxxxxxxxxxxx> [230616 05:19]:
> > On Wed, 2023-06-14 at 11:58 -0400, Liam R. Howlett wrote:
> > >
> > > External email : Please do not click links or open attachments
> until
> > > you have verified the sender or the content.
> > > * John Hsu (許永翰) <John.Hsu@xxxxxxxxxxxx> [230614 03:06]:
> > > > Hi Liam, thanks for your reply.
> > >
> > > Sorry, your email response with top posting is hard to follow so
> I
> > > will
> > > do my best to answer your questions.
> >
> > Sorry for the wrong format....
> >
> > > >
> > > >
> > > >
> > > > version 6.1 or 6.1.x? Which exact version (git id or version
> > > number)
> > > >
> > > > Our environment is kernel-6.1.25-mainline-android14-5-
> > > gdea04bf2c398d.
> > >
> > > Okay, I can have a look at 6.1.25 then.
> >
> > OK, thanks.
> >
> > > >
> > > >
> > > > This BUG_ON() is necessary since this function should _never_
> run
> > > out of
> > > >
> > > > memory; this function does not return an error code.
> > > mas_preallocate()
> > > >
> > > > should have gotten you the memory necessary (or returned an
> > > -ENOMEM)
> > > >
> > > > prior to the call to mas_store_prealloc(), so this is probably
> an
> > > >
> > > > internal tree problem.
> > > >
> > > > There is a tree operation being performed here. mprotect is
> > > merging a
> > > >
> > > > vma by the looks of the call stack. Why do you think no tree
> > > operation
> > > >
> > > > is necessary?
> > > >
> > > > As you mentioned, mas_preallocate() should allocate enough
> node,
> > > but there is such functions mas_node_count() in
> mas_store_prealloc().
> > > > In mas_node_count() checks whether the *mas* has enough nodes,
> and
> > > allocate memory for node if there was no enough nodes in mas.
> > >
> > > Right, we call mas_node_count() so that both code paths are used
> for
> > > preallocations and regular mas_store()/mas_store_gfp(). It
> shouldn't
> > > take a significant amount of time to verify there is enough
> nodes.
> >
> > Yap..., it didn't take a significant amount of time to verify
> whether
> > there is enough nodes. The problem is why the flow in
> mas_node_count
> > will alloc nodes if there was no enough nodes in mas?
>
> What I meant is that both methods use the same call path because
> there
> is not a reason to duplicate the path. After mas_preallocate() has
> allocated the nodes needed, the call to check if there is enough
> nodes
> will be quick.

So whether the purpose of mas_preallocate() is decreasing the lock
retention time?

> >
> > > > I think that if mas_preallocate() allocate enough node, why we
> > > check the node count and allocate nodes if there was no enough
> nodes
> > > in mas in mas_node_count()?
> > >
> > > We check for the above reason.
> > >
> >
> > OK..., this is one of the root cause of this BUG.
>
> The root cause is that there was not enough memory for a store
> operation. Regardless of if we check the allocations in the
> mas_store_prealloc() path or not, this would fail. If we remove the
> check for nodes within this path, then we would have to BUG_ON() when
> we
> run out of nodes to use or have a null pointer dereference BUG
> anyways.
>
Yap, the root cause is oom. The BUG_ON() for the situations that the
maple tree struct cannot be maintained because of the lack of memory is
necessary. But the the buddy system in linux kernel can reclaim memory
when the system is under the low memory status. If we use GFP_KERNEL
after trying GFP_NOWAIT to allocate node, maybe we can get enough
memory when the second try with GFP_KERNEL.
> >
> > > >
> > > > We have seen that there may be some maple_tree operations in
> > > merge_vma...
> > >
> > > If merge_vma() does anything, then there was an operation to the
> > > maple
> > > tree.
> > >
> > > >
> > > > Moreover, would maple_tree provides an API for assigning user's
> gfp
> > > flag for allocating node?
> > >
> > > mas_preallocate() and mas_store_gfp() has gfp flags as an
> > > argument. In
> > > your call stack, it will be called in __vma_adjust() as such:
> > >
> > > if (mas_preallocate(&mas, vma, GFP_KERNEL))
> > > return -ENOMEM;
> > >
> > > line 715 in v6.1.25
> > >
> > > > In rb_tree, we allocate vma_area_struct (rb_node is in this
> > > struct.) with GFP_KERNEL, and maple_tree allocate node with
> > > GFP_NOWAIT and __GFP_NOWARN.
> > >
> > > We use GFP_KERNEL as I explained above for the VMA tree.
> >
> > Got it! But the mas_node_count() always use GFP_NOWAIT and
> __GFP_NOWARN
> > in inserting tree flow. Do you consider the performance of
> maintaining
> > the structure of maple_tree?
>
> Sorry, I don't understand what you mean by 'consider the performance
> of
> maintaining the structure of maple_tree'.
>
As I mentioned above, GFP_NOWAIT will not allow buddy system for
reclaiming memory, so "Do you consider the performance of maintaining
the structure of maple_tree" means that: whether the mas_node_count()
path is not allowed to reclaim or compact memory for the performance.
> >
> > > It also will drop the lock and retry with GFP_KERNEL on failure
> > > when not using the external lock. The mmap_lock is configured as
> an
> > > external lock.
> > >
> > > > Allocation will not wait for reclaiming and compacting when
> there
> > > is no enough available memory.
> > > > Is there any concern for this design?
> > >
> > > This has been addressed above, but let me know if I missed
> anything
> > > here.
> > >
> >
> > I think that the mas_node_count() has higher rate of triggering
> > BUG_ON() when allocating nodes with GFP_NOWAIT and __GFP_NOWARN. If
> > mas_node_count() use GFP_KERNEL as mas_preallocate() in the mmap.c,
> the
> > allocation fail rate may be lower than use GFP_NOWAIT.
>
> Which BUG_ON() are you referring to?
>
> If I was to separate the code path for mas_store_prealloc() and
> mas_store_gfp(), then a BUG_ON() would still need to exist and still
> would have been triggered.. We are in a place in the code where we
> should never sleep and we don't have enough memory allocated to do
> what
> was necessary.
>
Yap. There is no reason to seprate mas_store_prealloc() and
mas_store_gfp. Is it possible to retry to allocate mas_node with
GFP_KERNEL (wait for system reclaim and compact) instead of triggering
BUG_ON once the GFP_NOWAIT allocation failed?

> Thanks,
> Liam

Best Regards,
John Hsu