Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3

From: Uladzislau Rezki
Date: Thu Feb 22 2024 - 03:35:55 EST


Hello, Folk!

> This is v3. It is based on the 6.7.0-rc8.
>
> 1. Motivation
>
> - Offload global vmap locks making it scaled to number of CPUS;
> - If possible and there is an agreement, we can remove the "Per cpu kva allocator"
> to make the vmap code to be more simple;
> - There were complains from XFS folk that a vmalloc might be contented
> on the their workloads.
>
> 2. Design(high level overview)
>
> We introduce an effective vmap node logic. A node behaves as independent
> entity to serve an allocation request directly(if possible) from its pool.
> That way it bypasses a global vmap space that is protected by its own lock.
>
> An access to pools are serialized by CPUs. Number of nodes are equal to
> number of CPUs in a system. Please note the high threshold is bound to
> 128 nodes.
>
> Pools are size segregated and populated based on system demand. The maximum
> alloc request that can be stored into a segregated storage is 256 pages. The
> lazily drain path decays a pool by 25% as a first step and as second populates
> it by fresh freed VAs for reuse instead of returning them into a global space.
>
> When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start
> address is converted into a correct node where it should be placed and resided.
> Doing so we balance VAs across the nodes as a result an access becomes scalable.
> The addr_to_node() function does a proper address conversion to a correct node.
>
> A vmap space is divided on segments with fixed size, it is 16 pages. That way
> any address can be associated with a segment number. Number of segments are
> equal to num_possible_cpus() but not grater then 128. The numeration starts
> from 0. See below how it is converted:
>
> static inline unsigned int
> addr_to_node_id(unsigned long addr)
> {
> return (addr / zone_size) % nr_nodes;
> }
>
> On a free path, a VA can be easily found by converting its "va_start" address
> to a certain node it resides. It is moved from "busy" data to "lazy" data structure.
> Later on, as noted earlier, the lazy kworker decays each node pool and populates it
> by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc
> request.
>
> 3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor
>
> sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
>
> <default perf>
> 94.41% 0.89% [kernel] [k] _raw_spin_lock
> 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath
> 76.13% 0.28% [kernel] [k] __vmalloc_node_range
> 72.96% 0.81% [kernel] [k] alloc_vmap_area
> 56.94% 0.00% [kernel] [k] __get_vm_area_node
> 41.95% 0.00% [kernel] [k] vmalloc
> 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test
> 35.17% 0.00% [kernel] [k] ret_from_fork_asm
> 35.17% 0.00% [kernel] [k] ret_from_fork
> 35.17% 0.00% [kernel] [k] kthread
> 35.08% 0.00% [test_vmalloc] [k] test_func
> 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test
> 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test
> 23.53% 0.25% [kernel] [k] vfree.part.0
> 21.72% 0.00% [kernel] [k] remove_vm_area
> 20.08% 0.21% [kernel] [k] find_unlink_vmap_area
> 2.34% 0.61% [kernel] [k] free_vmap_area_noflush
> <default perf>
> vs
> <patch-series perf>
> 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test
> 63.36% 0.02% [kernel] [k] vmalloc
> 63.34% 2.64% [kernel] [k] __vmalloc_node_range
> 30.42% 4.46% [kernel] [k] vfree.part.0
> 28.98% 2.51% [kernel] [k] __alloc_pages_bulk
> 27.28% 0.19% [kernel] [k] __get_vm_area_node
> 26.13% 1.50% [kernel] [k] alloc_vmap_area
> 21.72% 21.67% [kernel] [k] clear_page_rep
> 19.51% 2.43% [kernel] [k] _raw_spin_lock
> 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath
> 13.40% 2.07% [kernel] [k] free_unref_page
> 10.62% 0.01% [kernel] [k] remove_vm_area
> 9.02% 8.73% [kernel] [k] insert_vmap_area
> 8.94% 0.00% [kernel] [k] ret_from_fork_asm
> 8.94% 0.00% [kernel] [k] ret_from_fork
> 8.94% 0.00% [kernel] [k] kthread
> 8.29% 0.00% [test_vmalloc] [k] test_func
> 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test
> 5.30% 4.73% [kernel] [k] purge_vmap_node
> 4.47% 2.65% [kernel] [k] free_vmap_area_noflush
> <patch-series perf>
>
> confirms that a native_queued_spin_lock_slowpath goes down to
> 16.51% percent from 93.07%.
>
> The throughput is ~12x higher:
>
> urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
> Run the test with following parameters: run_test_mask=7 nr_threads=64
> Done.
> Check the kernel ring buffer to see the summary.
>
> real 10m51.271s
> user 0m0.013s
> sys 0m0.187s
> urezki@pc638:~$
>
> urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
> Run the test with following parameters: run_test_mask=7 nr_threads=64
> Done.
> Check the kernel ring buffer to see the summary.
>
> real 0m51.301s
> user 0m0.015s
> sys 0m0.040s
> urezki@pc638:~$
>
> 4. Changelog
>
> v1: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/
> v2: https://lore.kernel.org/lkml/20230829081142.3619-1-urezki@xxxxxxxxx/
>
> Delta v2 -> v3:
> - fix comments from v2 feedback;
> - switch from pre-fetch chunk logic to a less complex size based pools.
>
> Baoquan He (1):
> mm/vmalloc: remove vmap_area_list
>
> Uladzislau Rezki (Sony) (10):
> mm: vmalloc: Add va_alloc() helper
> mm: vmalloc: Rename adjust_va_to_fit_type() function
> mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
> mm: vmalloc: Remove global vmap_area_root rb-tree
> mm: vmalloc: Remove global purge_vmap_area_root rb-tree
> mm: vmalloc: Offload free_vmap_area_lock lock
> mm: vmalloc: Support multiple nodes in vread_iter
> mm: vmalloc: Support multiple nodes in vmallocinfo
> mm: vmalloc: Set nr_nodes based on CPUs in a system
> mm: vmalloc: Add a shrinker to drain vmap pools
>
> .../admin-guide/kdump/vmcoreinfo.rst | 8 +-
> arch/arm64/kernel/crash_core.c | 1 -
> arch/riscv/kernel/crash_core.c | 1 -
> include/linux/vmalloc.h | 1 -
> kernel/crash_core.c | 4 +-
> kernel/kallsyms_selftest.c | 1 -
> mm/nommu.c | 2 -
> mm/vmalloc.c | 1049 ++++++++++++-----
> 8 files changed, 786 insertions(+), 281 deletions(-)
>
> --
> 2.39.2
>
There is one thing that i have to clarify and which is open for me yet.

Test machine:
quemu x86_64 system
64 CPUs
64G of memory

test suite:
test_vmalloc.sh

environment:
mm-unstable, branch: next-20240220 where this series
is located. On top of it i added locally Suren's Baghdasaryan
Memory allocation profiling v3 for better understanding of memory
usage.

Before running test, the condition is as below:

urezki@pc638:~$ sort -h /proc/allocinfo
27.2MiB 6970 mm/memory.c:1122 module:memory func:folio_prealloc
79.1MiB 20245 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded
112MiB 8689 mm/slub.c:2202 module:slub func:alloc_slab_page
122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
urezki@pc638:~$ free -m
total used free shared buff/cache available
Mem: 64172 936 63618 0 134 63236
Swap: 0 0 0
urezki@pc638:~$

The test-suite stresses vmap/vmalloc layer by creating workers which in
a tight loop do alloc/free, i.e. it is considered as extreme. Below three
identical tests were done with only one difference, which is 64, 128 and 256 kworkers:

1) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64

urezki@pc638:~$ sort -h /proc/allocinfo
80.1MiB 20518 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded
122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
153MiB 39048 mm/filemap.c:1919 module:filemap func:__filemap_get_folio
178MiB 13259 mm/slub.c:2202 module:slub func:alloc_slab_page
350MiB 89656 include/linux/mm.h:2848 module:memory func:pagetable_alloc
urezki@pc638:~$ free -m
total used free shared buff/cache available
Mem: 64172 1417 63054 0 298 62755
Swap: 0 0 0
urezki@pc638:~$

2) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128

urezki@pc638:~$ sort -h /proc/allocinfo
122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
154MiB 39440 mm/filemap.c:1919 module:filemap func:__filemap_get_folio
196MiB 14038 mm/slub.c:2202 module:slub func:alloc_slab_page
1.20GiB 315655 include/linux/mm.h:2848 module:memory func:pagetable_alloc
urezki@pc638:~$ free -m
total used free shared buff/cache available
Mem: 64172 2556 61914 0 302 61616
Swap: 0 0 0
urezki@pc638:~$

3) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256

urezki@pc638:~$ sort -h /proc/allocinfo
127MiB 32565 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded
197MiB 50506 mm/filemap.c:1919 module:filemap func:__filemap_get_folio
278MiB 18519 mm/slub.c:2202 module:slub func:alloc_slab_page
5.36GiB 1405072 include/linux/mm.h:2848 module:memory func:pagetable_alloc
urezki@pc638:~$ free -m
total used free shared buff/cache available
Mem: 64172 6741 57652 0 394 57431
Swap: 0 0 0
urezki@pc638:~$

pagetable_alloc - gets increased as soon as a higher pressure is applied by
increasing number of workers. Running same number of jobs on a next run
does not increase it and stays on same level as on previous.

/**
* pagetable_alloc - Allocate pagetables
* @gfp: GFP flags
* @order: desired pagetable order
*
* pagetable_alloc allocates memory for page tables as well as a page table
* descriptor to describe that memory.
*
* Return: The ptdesc describing the allocated page tables.
*/
static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
{
struct page *page = alloc_pages(gfp | __GFP_COMP, order);

return page_ptdesc(page);
}

Could you please comment on it? Or do you have any thought? Is it expected?
Is a page-table ever shrink?

/proc/slabinfo does not show any high "active" or "number" of objects to
be used by any cache.

/proc/meminfo - "VmallocUsed" stays low after those 3 tests.

I have checked it with KASAN, KMEMLEAK and i do not see any issues.

Thank you for the help!

--
Uladzislau Rezki