Re: [RFC v1 4/5] maple_tree: avoid bulk alloc/free to use percpu array more

From: Vlastimil Babka
Date: Tue Aug 08 2023 - 13:11:40 EST


On 8/8/23 16:29, Liam R. Howlett wrote:
> * Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [230808 07:17]:
>>
>>
>> 在 2023/8/8 17:53, Vlastimil Babka 写道:
>> > Using bulk alloc/free on a cache with percpu array should not be
>> > necessary and the bulk alloc actually bypasses the array (the prefill
>> > functionality currently relies on this).
>> >
>> > The simplest change is just to convert the respective maple tree
>> > wrappers to do a loop of normal alloc/free.
>> > ---
>> > lib/maple_tree.c | 11 +++++++++--
>> > 1 file changed, 9 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> > index 1196d0a17f03..7a8e7c467d7c 100644
>> > --- a/lib/maple_tree.c
>> > +++ b/lib/maple_tree.c
>> > @@ -161,12 +161,19 @@ static inline struct maple_node *mt_alloc_one(gfp_t gfp)
>> > static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
>> > {
>> > - return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
>> > + int allocated = 0;
>> > + for (size_t i = 0; i < size; i++) {
>> > + nodes[i] = kmem_cache_alloc(maple_node_cache, gfp);
>> > + if (nodes[i])
>> If the i-th allocation fails, node[i] will be NULL. This is wrong. We'd
>> better guarantee that mt_alloc_bulk() allocates completely successfully,
>> or returns 0. The following cases are not allowed:
>> nodes: [addr1][addr2][NULL][addr3].

Thanks, indeed. I guess it should just break; in case of failure and return
how many allocations succeeded so far.

But note this is a really a quick RFC proof of concept hack. I'd expect if
the whole idea is deemed as good, the maple tree node handling could be
redesigned (simplified?) around it and maybe there's no mt_alloc_bulk()
anymore as a result?

> Thanks for pointing this out Peng.
>
> We can handle a lower number than requested being returned, but we
> cannot handle the sparse data.
>
> The kmem_cache_alloc_bulk() can return a failure today - leaving the
> array to be cleaned by the caller, so if this is changed to a full
> success or full fail, then we will also have to change the caller to
> handle whatever state is returned if it differs from
> kmem_cache_alloc_bulk().
>
> It might be best to return the size already allocated when a failure is
> encountered. This will make the caller, mas_alloc_nodes(), request more
> nodes. Only in the case of zero allocations would this be seen as an
> OOM event.
>
> Vlastimil, Is the first kmem_cache_alloc() call failing a possibility?

Sure, if there's no memory, it can fail. In practice if gfp is one that
allows reclaim, it will ultimately be the "too small to fail" allocation on
the page allocator level. But there are exceptions, like having received a
fatal signal, IIRC :)

> If so, what should be the corrective action?

Depends on your context, if you can pass on -ENOMEM to the caller, or need
to succeed.

>> > + allocated++;
>> > + }
>> > + return allocated;
>> > }
>> > static inline void mt_free_bulk(size_t size, void __rcu **nodes)
>> > {
>> > - kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
>> > + for (size_t i = 0; i < size; i++)
>> > + kmem_cache_free(maple_node_cache, nodes[i]);
>> > }
>> > static void mt_free_rcu(struct rcu_head *head)