Re: [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()

From: Liam R. Howlett
Date: Wed Jul 26 2023 - 13:07:41 EST


* Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [230726 04:10]:
> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
> directly modify the entries of VMAs in the new maple tree, which can
> get better performance. dup_mmap() is used by fork(), so this patch
> optimizes fork(). The optimization effect is proportional to the number
> of VMAs.
>
> Due to the introduction of this method, the optimization in
> (maple_tree: add a fast path case in mas_wr_slot_store())[1] no longer
> has an effect here, but it is also an optimization of the maple tree.
>
> There is a unixbench test suite[2] where 'spawn' is used to test fork().
> 'spawn' only has 23 VMAs by default, so I tweaked the benchmark code a
> bit to use mmap() to control the number of VMAs. Therefore, the
> performance under different numbers of VMAs can be measured.
>
> Insert code like below into 'spawn':
> for (int i = 0; i < 200; ++i) {
> size_t size = 10 * getpagesize();
> void *addr;
>
> if (i & 1) {
> addr = mmap(NULL, size, PROT_READ,
> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> } else {
> addr = mmap(NULL, size, PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> }
> if (addr == MAP_FAILED)
> ...
> }
>
> Based on next-20230721, use 'spawn' under 23, 203, and 4023 VMAs, test
> 4 times in 30 seconds each time, and get the following numbers. These
> numbers are the number of fork() successes in 30s (average of the best
> 3 out of 4). By the way, based on next-20230725, I reverted [1], and
> tested it together as a comparison. In order to ensure the reliability
> of the test results, these tests were run on a physical machine.
>
> 23VMAs 223VMAs 4023VMAs
> revert [1]: 159104.00 73316.33 6787.00

You can probably remove the revert benchmark from this since there is no
reason to revert the previous change. The change is worth while on its
own, so it's better to have the numbers more clear by having with and
without this series.

>
> +0.77% +0.42% +0.28%
> next-20230721: 160321.67 73624.67 6806.33
>
> +2.77% +15.42% +29.86%
> apply this: 164751.67 84980.33 8838.67

What is the difference between using this patch with mas_replace_entry()
and mas_store_entry()?

>
> It can be seen that the performance improvement is proportional to
> the number of VMAs. With 23 VMAs, performance improves by about 3%,
> with 223 VMAs, performance improves by about 15%, and with 4023 VMAs,
> performance improves by about 30%.
>
> [1] https://lore.kernel.org/lkml/20230628073657.75314-4-zhangpeng.00@xxxxxxxxxxxxx/
> [2] https://github.com/kdlucas/byte-unixbench/tree/master
>
> Signed-off-by: Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx>
> ---
> kernel/fork.c | 35 +++++++++++++++++++++++++++--------
> mm/mmap.c | 14 ++++++++++++--
> 2 files changed, 39 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f81149739eb9..ef80025b62d6 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> int retval;
> unsigned long charge = 0;
> LIST_HEAD(uf);
> - VMA_ITERATOR(old_vmi, oldmm, 0);
> VMA_ITERATOR(vmi, mm, 0);
>
> uprobe_start_dup_mmap();
> @@ -678,17 +677,40 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> goto out;
> khugepaged_fork(mm, oldmm);
>
> - retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
> - if (retval)
> + /* Use __mt_dup() to efficiently build an identical maple tree. */
> + retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
> + if (unlikely(retval))
> goto out;
>
> mt_clear_in_rcu(vmi.mas.tree);
> - for_each_vma(old_vmi, mpnt) {
> + for_each_vma(vmi, mpnt) {
> struct file *file;
>
> vma_start_write(mpnt);
> if (mpnt->vm_flags & VM_DONTCOPY) {
> vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> +
> + /*
> + * Since the new tree is exactly the same as the old one,
> + * we need to remove the unneeded VMAs.
> + */
> + mas_store(&vmi.mas, NULL);
> +
> + /*
> + * Even removing an entry may require memory allocation,
> + * and if removal fails, we use XA_ZERO_ENTRY to mark
> + * from which VMA it failed. The case of encountering
> + * XA_ZERO_ENTRY will be handled in exit_mmap().
> + */
> + if (unlikely(mas_is_err(&vmi.mas))) {
> + retval = xa_err(vmi.mas.node);
> + mas_reset(&vmi.mas);
> + if (mas_find(&vmi.mas, ULONG_MAX))
> + mas_replace_entry(&vmi.mas,
> + XA_ZERO_ENTRY);
> + goto loop_out;
> + }
> +
> continue;
> }
> charge = 0;
> @@ -750,8 +772,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> hugetlb_dup_vma_private(tmp);
>
> /* Link the vma into the MT */
> - if (vma_iter_bulk_store(&vmi, tmp))
> - goto fail_nomem_vmi_store;
> + mas_replace_entry(&vmi.mas, tmp);
>
> mm->map_count++;
> if (!(tmp->vm_flags & VM_WIPEONFORK))
> @@ -778,8 +799,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> uprobe_end_dup_mmap();
> return retval;
>
> -fail_nomem_vmi_store:
> - unlink_anon_vmas(tmp);
> fail_nomem_anon_vma_fork:
> mpol_put(vma_policy(tmp));
> fail_nomem_policy:
> diff --git a/mm/mmap.c b/mm/mmap.c
> index bc91d91261ab..5bfba2fb0e39 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3184,7 +3184,11 @@ void exit_mmap(struct mm_struct *mm)
> arch_exit_mmap(mm);
>
> vma = mas_find(&mas, ULONG_MAX);
> - if (!vma) {
> + /*
> + * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
> + * xa_is_zero(vma) may be true.
> + */
> + if (!vma || xa_is_zero(vma)) {
> /* Can happen if dup_mmap() received an OOM */
> mmap_read_unlock(mm);
> return;
> @@ -3222,7 +3226,13 @@ void exit_mmap(struct mm_struct *mm)
> remove_vma(vma, true);
> count++;
> cond_resched();
> - } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> + vma = mas_find(&mas, ULONG_MAX);
> + /*
> + * If xa_is_zero(vma) is true, it means that subsequent VMAs
> + * donot need to be removed. Can happen if dup_mmap() fails to
> + * remove a VMA marked VM_DONTCOPY.
> + */
> + } while (vma != NULL && !xa_is_zero(vma));
>
> BUG_ON(count != mm->map_count);
>
> --
> 2.20.1
>