Re: [BUG] trigger BUG_ON in mas_store_prealloc when low memory

From: Liam R. Howlett
Date: Mon Jul 10 2023 - 10:24:53 EST


* John Hsu (許永翰) <John.Hsu@xxxxxxxxxxxx> [230710 08:50]:
...

> > > > > This BUG_ON() is necessary since this function should _never_
> > run
> > > > out of
> > > > >
> > > > > memory; this function does not return an error code.
> > > > mas_preallocate()
> > > > >
> > > > > should have gotten you the memory necessary (or returned an
> > > > -ENOMEM)
> > > > >
> > > > > prior to the call to mas_store_prealloc(), so this is probably
> > an
> > > > >
> > > > > internal tree problem.
> > > > >
> > > > > There is a tree operation being performed here. mprotect is
> > > > merging a
> > > > >
> > > > > vma by the looks of the call stack. Why do you think no tree
> > > > operation
> > > > >
> > > > > is necessary?
> > > > >
> > > > > As you mentioned, mas_preallocate() should allocate enough
> > node,
> > > > but there is such functions mas_node_count() in
> > mas_store_prealloc().
> > > > > In mas_node_count() checks whether the *mas* has enough nodes,
> > and
> > > > allocate memory for node if there was no enough nodes in mas.
> > > >
> > > > Right, we call mas_node_count() so that both code paths are used
> > for
> > > > preallocations and regular mas_store()/mas_store_gfp(). It
> > shouldn't
> > > > take a significant amount of time to verify there is enough
> > nodes.
> > >
> > > Yap..., it didn't take a significant amount of time to verify
> > whether
> > > there is enough nodes. The problem is why the flow in
> > mas_node_count
> > > will alloc nodes if there was no enough nodes in mas?
> >
> > What I meant is that both methods use the same call path because
> > there
> > is not a reason to duplicate the path. After mas_preallocate() has
> > allocated the nodes needed, the call to check if there is enough
> > nodes
> > will be quick.
>
> So whether the purpose of mas_preallocate() is decreasing the lock
> retention time?

It could be, but in this case it's the locking order. We have to
pre-allocate and fail early if we are out of memory, because we _cannot_
use GFP_KERNEL where we call mas_store_prealloc(). mas_preallocate()
will use GFP_KERENL though. We cannot use GFP_KERNEL during the store
because reclaim may sleep and we cannot sleep holding the locks we need
to hold at the time of the store operation in __vma_adjust().

>
> > >
> > > > > I think that if mas_preallocate() allocate enough node, why we
> > > > check the node count and allocate nodes if there was no enough
> > nodes
> > > > in mas in mas_node_count()?
> > > >
> > > > We check for the above reason.
> > > >
> > >
> > > OK..., this is one of the root cause of this BUG.
> >
> > The root cause is that there was not enough memory for a store
> > operation. Regardless of if we check the allocations in the
> > mas_store_prealloc() path or not, this would fail. If we remove the
> > check for nodes within this path, then we would have to BUG_ON() when
> > we
> > run out of nodes to use or have a null pointer dereference BUG
> > anyways.
> >
> Yap, the root cause is oom. The BUG_ON() for the situations that the
> maple tree struct cannot be maintained because of the lack of memory is
> necessary. But the the buddy system in linux kernel can reclaim memory
> when the system is under the low memory status. If we use GFP_KERNEL
> after trying GFP_NOWAIT to allocate node, maybe we can get enough
> memory when the second try with GFP_KERNEL.

Right, but the GFP_KERNEL cannot be used when holding certain locks
because it may sleep.

> > >
> > > > >
> > > > > We have seen that there may be some maple_tree operations in
> > > > merge_vma...
> > > >
> > > > If merge_vma() does anything, then there was an operation to the
> > > > maple
> > > > tree.
> > > >
> > > > >
> > > > > Moreover, would maple_tree provides an API for assigning user's
> > gfp
> > > > flag for allocating node?
> > > >
> > > > mas_preallocate() and mas_store_gfp() has gfp flags as an
> > > > argument. In
> > > > your call stack, it will be called in __vma_adjust() as such:
> > > >
> > > > if (mas_preallocate(&mas, vma, GFP_KERNEL))
> > > > return -ENOMEM;
> > > >
> > > > line 715 in v6.1.25
> > > >
> > > > > In rb_tree, we allocate vma_area_struct (rb_node is in this
> > > > struct.) with GFP_KERNEL, and maple_tree allocate node with
> > > > GFP_NOWAIT and __GFP_NOWARN.
> > > >
> > > > We use GFP_KERNEL as I explained above for the VMA tree.
> > >
> > > Got it! But the mas_node_count() always use GFP_NOWAIT and
> > __GFP_NOWARN
> > > in inserting tree flow. Do you consider the performance of
> > maintaining
> > > the structure of maple_tree?
> >
> > Sorry, I don't understand what you mean by 'consider the performance
> > of
> > maintaining the structure of maple_tree'.
> >
> As I mentioned above, GFP_NOWAIT will not allow buddy system for
> reclaiming memory, so "Do you consider the performance of maintaining
> the structure of maple_tree" means that: whether the mas_node_count()
> path is not allowed to reclaim or compact memory for the performance.

Ah, no. This is not for performance. It was initially on the road map
for performance, but it was needed for the complicated locking in the MM
code.

rb tree embedded the nodes in the VMA which is allocated outside this
critical section and so it could use GFP_KERNEL.

> > >
> > > > It also will drop the lock and retry with GFP_KERNEL on failure
> > > > when not using the external lock. The mmap_lock is configured as
> > an
> > > > external lock.
> > > >
> > > > > Allocation will not wait for reclaiming and compacting when
> > there
> > > > is no enough available memory.
> > > > > Is there any concern for this design?
> > > >
> > > > This has been addressed above, but let me know if I missed
> > anything
> > > > here.
> > > >
> > >
> > > I think that the mas_node_count() has higher rate of triggering
> > > BUG_ON() when allocating nodes with GFP_NOWAIT and __GFP_NOWARN. If
> > > mas_node_count() use GFP_KERNEL as mas_preallocate() in the mmap.c,
> > the
> > > allocation fail rate may be lower than use GFP_NOWAIT.
> >
> > Which BUG_ON() are you referring to?
> >
> > If I was to separate the code path for mas_store_prealloc() and
> > mas_store_gfp(), then a BUG_ON() would still need to exist and still
> > would have been triggered.. We are in a place in the code where we
> > should never sleep and we don't have enough memory allocated to do
> > what
> > was necessary.
> >
> Yap. There is no reason to seprate mas_store_prealloc() and
> mas_store_gfp. Is it possible to retry to allocate mas_node with
> GFP_KERNEL (wait for system reclaim and compact) instead of triggering
> BUG_ON once the GFP_NOWAIT allocation failed?

Unfortunately not, no. In some cases it may actually work out that the
VMA may not need the locks in question, but it cannot be generalized for
__vma_adjust(). Where I am able, I use the mas_store_gfp() calls so we
can let reclaim and compact run, but it's not possible here.

Thanks,
Liam