Re: [PATCH 00/14] Reduce preallocations for maple tree

From: Yin Fengwei
Date: Mon Jun 05 2023 - 21:34:09 EST




On 6/5/23 22:03, Liam R. Howlett wrote:
> * Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [230605 03:59]:
>>
>>
>> 在 2023/6/5 14:18, Yin, Fengwei 写道:
>>>
>>>
>>> On 6/5/2023 12:41 PM, Yin Fengwei wrote:
>>>> Hi Peng,
>>>>
>>>> On 6/5/23 11:28, Peng Zhang wrote:
>>>>>
>>>>>
>>>>> 在 2023/6/2 16:10, Yin, Fengwei 写道:
>>>>>> Hi Liam,
>>>>>>
>>>>>> On 6/1/2023 10:15 AM, Liam R. Howlett wrote:
>>>>>>> Initial work on preallocations showed no regression in performance
>>>>>>> during testing, but recently some users (both on [1] and off [android]
>>>>>>> list) have reported that preallocating the worst-case number of nodes
>>>>>>> has caused some slow down.  This patch set addresses the number of
>>>>>>> allocations in a few ways.
>>>>>>>
>>>>>>> During munmap() most munmap() operations will remove a single VMA, so
>>>>>>> leverage the fact that the maple tree can place a single pointer at
>>>>>>> range 0 - 0 without allocating.  This is done by changing the index in
>>>>>>> the 'sidetree'.
>>>>>>>
>>>>>>> Re-introduce the entry argument to mas_preallocate() so that a more
>>>>>>> intelligent guess of the node count can be made.
>>>>>>>
>>>>>>> Patches are in the following order:
>>>>>>> 0001-0002: Testing framework for benchmarking some operations
>>>>>>> 0003-0004: Reduction of maple node allocation in sidetree
>>>>>>> 0005:      Small cleanup of do_vmi_align_munmap()
>>>>>>> 0006-0013: mas_preallocate() calculation change
>>>>>>> 0014:      Change the vma iterator order
>>>>>> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with
>>>>>> this patchset.
>>>>>>
>>>>>> The result has a little bit improvement:
>>>>>> Base (next-20230602):
>>>>>>    503880
>>>>>> Base with this patchset:
>>>>>>    519501
>>>>>>
>>>>>> But they are far from the none-regression result (commit 7be1c1a3c7b1):
>>>>>>    718080
>>>>>>
>>>>>>
>>>>>> Some other information I collected:
>>>>>> With Base, the mas_alloc_nodes are always hit with request: 7.
>>>>>> With this patchset, the request are 1 or 5.
>>>>>>
>>>>>> I suppose this is the reason for improvement from 503880 to 519501.
>>>>>>
>>>>>> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered
>>>>>> mas_alloc_nodes() call. Thanks.
>>>>> Hi Fengwei,
>>>>>
>>>>> I think it may be related to the inaccurate number of nodes allocated
>>>>> in the pre-allocation. I slightly modified the pre-allocation in this
>>>>> patchset, but I don't know if it works. It would be great if you could
>>>>> help test it, and help pinpoint the cause. Below is the diff, which can
>>>>> be applied based on this pachset.
>>>> I tried the patch, it could eliminate the call of mas_alloc_nodes() during
>>>> the test. But the result of benchmark got a little bit improvement:
>>>> 529040
>>>>
>>>> But it's still much less than none-regression result. I will also double
>>>> confirm the none-regression result.
>>> Just noticed that the commit f5715584af95 make validate_mm() two implementation
>>> based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have
>>> CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an
>>> apple to apple.
>
> You mean "mm: update validate_mm() to use vma iterator" here I guess. I
> have it as a different commit id in my branch.
Yes. The f5715584af95 was from next-20230602. I will include the subject
of the patch in the future to avoid such confusion.

>
> I 'restored' some of the checking because I was able to work around not
> having the mt_dump() definition with the vma iterator. I'm now
> wondering how wide spread CONFIG_DEBUG_VM is used and if I should not
> have added these extra checks.
No worries here. The problem is in my local config. I enabled the
CONFIG_DEBUG_VM to capture VM asserts/warnings. It should be disabled
for performance testing. LKP does disable it for performance check.


Regards
Yin, Fengwei

>
>>>
>>>
>>> I disable CONFIG_DEBUG_VM and re-run the test and got:
>>> Before preallocation change (7be1c1a3c7b1):
>>> 770100
>>> After preallocation change (28c5609fb236):
>>> 680000
>>> With liam's fix:
>>> 702100
>>> plus Peng's fix:
>>> 725900
>> Thank you for your test, now it seems that the performance
>> regression is not so much.
>
> We are also too strict on the reset during mas_store_prealloc() checking
> for a spanning write. I have a fix for this for v2 of the patch set,
> although I suspect it will not make a huge difference.
>
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>>
>>>>>
>>>>> Thanks,
>>>>> Peng
>>>>>
>>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>>>>> index 5ea211c3f186..e67bf2744384 100644
>>>>> --- a/lib/maple_tree.c
>>>>> +++ b/lib/maple_tree.c
>>>>> @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>>>>          goto ask_now;
>>>>>      }
>>>>>
>>>>> -    /* New root needs a singe node */
>>>>> -    if (unlikely(mte_is_root(mas->node)))
>>>>> -        goto ask_now;
>
> Why did you drop this? If we are creating a new root we will only need
> one node.
>
>>>>> +    if ((node_size == wr_mas.node_end + 1 &&
>>>>> +         mas->offset == wr_mas.node_end) ||
>>>>> +        (node_size == wr_mas.node_end &&
>>>>> +         wr_mas.offset_end - mas->offset == 1))
>>>>> +        return 0;
>
> I will add this to v2 as well, or something similar.
>
>>>>>
>>>>>      /* Potential spanning rebalance collapsing a node, use worst-case */
>>>>>      if (node_size  - 1 <= mt_min_slots[wr_mas.type])
>>>>> @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>>>>      if (likely(!mas_is_err(mas)))
>>>>>          return 0;
>>>>>
>>>>> -    mas_set_alloc_req(mas, 0);
>
> Why did you drop this? It seems like a worth while cleanup on failure.
>
>>>>>      ret = xa_err(mas->node);
>>>>>      mas_reset(mas);
>>>>>      mas_destroy(mas);
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Yin, Fengwei
>>>>>>
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@xxxxxxxxx/
>>>>>>>
>>>>>>> Liam R. Howlett (14):
>>>>>>>    maple_tree: Add benchmarking for mas_for_each
>>>>>>>    maple_tree: Add benchmarking for mas_prev()
>>>>>>>    mm: Move unmap_vmas() declaration to internal header
>>>>>>>    mm: Change do_vmi_align_munmap() side tree index
>>>>>>>    mm: Remove prev check from do_vmi_align_munmap()
>>>>>>>    maple_tree: Introduce __mas_set_range()
>>>>>>>    mm: Remove re-walk from mmap_region()
>>>>>>>    maple_tree: Re-introduce entry to mas_preallocate() arguments
>>>>>>>    mm: Use vma_iter_clear_gfp() in nommu
>>>>>>>    mm: Set up vma iterator for vma_iter_prealloc() calls
>>>>>>>    maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
>>>>>>>    maple_tree: Update mas_preallocate() testing
>>>>>>>    maple_tree: Refine mas_preallocate() node calculations
>>>>>>>    mm/mmap: Change vma iteration order in do_vmi_align_munmap()
>>>>>>>
>>>>>>>   fs/exec.c                        |   1 +
>>>>>>>   include/linux/maple_tree.h       |  23 ++++-
>>>>>>>   include/linux/mm.h               |   4 -
>>>>>>>   lib/maple_tree.c                 |  78 ++++++++++----
>>>>>>>   lib/test_maple_tree.c            |  74 +++++++++++++
>>>>>>>   mm/internal.h                    |  40 ++++++--
>>>>>>>   mm/memory.c                      |  16 ++-
>>>>>>>   mm/mmap.c                        | 171 ++++++++++++++++---------------
>>>>>>>   mm/nommu.c                       |  45 ++++----
>>>>>>>   tools/testing/radix-tree/maple.c |  59 ++++++-----
>>>>>>>   10 files changed, 331 insertions(+), 180 deletions(-)
>>>>>>>