Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree

From: Wen Gu
Date: Fri Jan 05 2024 - 03:11:16 EST



On 2024/01/03 02:46, Uladzislau Rezki wrote:

Store allocated objects in a separate nodes. A va->va_start
address is converted into a correct node where it should
be placed and resided. An addr_to_node() function is used
to do a proper address conversion to determine a node that
contains a VA.

Such approach balances VAs across nodes as a result an access
becomes scalable. Number of nodes in a system depends on number
of CPUs.

Please note:

1. As of now allocated VAs are bound to a node-0. It means the
patch does not give any difference comparing with a current
behavior;

2. The global vmap_area_lock, vmap_area_root are removed as there
is no need in it anymore. The vmap_area_list is still kept and
is _empty_. It is exported for a kexec only;

3. The vmallocinfo and vread() have to be reworked to be able to
handle multiple nodes.

Reviewed-by: Baoquan He <bhe@xxxxxxxxxx>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>
---

<...>

struct vmap_area *find_vmap_area(unsigned long addr)
{
+ struct vmap_node *vn;
struct vmap_area *va;
+ int i, j;
- spin_lock(&vmap_area_lock);
- va = __find_vmap_area(addr, &vmap_area_root);
- spin_unlock(&vmap_area_lock);
+ /*
+ * An addr_to_node_id(addr) converts an address to a node index
+ * where a VA is located. If VA spans several zones and passed
+ * addr is not the same as va->va_start, what is not common, we
+ * may need to scan an extra nodes. See an example:
+ *
+ * <--va-->
+ * -|-----|-----|-----|-----|-
+ * 1 2 0 1
+ *
+ * VA resides in node 1 whereas it spans 1 and 2. If passed
+ * addr is within a second node we should do extra work. We
+ * should mention that it is rare and is a corner case from
+ * the other hand it has to be covered.
+ */
+ i = j = addr_to_node_id(addr);
+ do {
+ vn = &vmap_nodes[i];
- return va;
+ spin_lock(&vn->busy.lock);
+ va = __find_vmap_area(addr, &vn->busy.root);
+ spin_unlock(&vn->busy.lock);
+
+ if (va)
+ return va;
+ } while ((i = (i + 1) % nr_vmap_nodes) != j);
+
+ return NULL;
}

Hi Uladzislau Rezki,

I really like your work, it is great and helpful!

Currently, I am working on using shared memory communication (SMC [1])
to transparently accelerate TCP communication between two peers within
the same OS instance[2].

In this scenario, a vzalloced kernel buffer acts as a shared memory and
will be simultaneous read or written by two SMC sockets, thus forming an
SMC connection.


socket1 socket2
| ^
| | userspace
---- write -------------------- read ------
| +-----------------+ | kernel
+--->| shared memory |---+
| (vzalloced now) |
+-----------------+

Then I encountered the performance regression caused by lock contention
in find_vmap_area() when multiple threads transfer data through multiple
SMC connections on machines with many CPUs[3].

According to perf, the performance bottleneck is caused by the global
vmap_area_lock contention[4]:

- writer:

smc_tx_sendmsg
-> memcpy_from_msg
-> copy_from_iter
-> check_copy_size
-> check_object_size
-> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object
-> if(vm) find_vmap_area
-> try to hold vmap_area_lock lock
- reader:

smc_rx_recvmsg
-> memcpy_to_msg
-> copy_to_iter
-> check_copy_size
-> check_object_size
-> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object
-> if(vm) find_vmap_area
-> try to hold vmap_area_lock lock

Fortunately, thank you for this patch set, the global vmap_area_lock was
removed and per node lock vn->busy.lock is introduced. it is really helpful:

In 48 CPUs qemu environment, the Requests/s increased by 5 times:
- nginx
- wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80

vzalloced shmem vzalloced shmem(with this patch set)
Requests/sec 113536.56 583729.93


But it also has some overhead, compared to using kzalloced shared memory
or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area:

kzalloced shmem vzalloced shmem(unset CONFIG_HARDENED_USERCOPY)
Requests/sec 831950.39 805164.78


So, as a newbie in Linux-mm, I would like to ask for some suggestions:

Is it possible to further eliminate the overhead caused by lock contention
in find_vmap_area() in this scenario (maybe this is asking too much), or the
only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced
buffer in the situation where cocurrent kernel-userspace-copy happens?

Any feedback will be appreciated. Thanks again for your time.


[1] Shared Memory Communications (SMC) enables two SMC capable peers to
communicate by using memory buffers that each peer allocates for the
partner's use. It improves throughput, lowers latency and cost, and
maintains existing functions. See details in https://www.ibm.com/support/pages/node/7009315

[2] https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@xxxxxxxxxxxxxxxxx/

[3] issues: https://lore.kernel.org/all/1fbd6b74-1080-923a-01c1-689c3d65f880@xxxxxxxxxx/
analysis: https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@xxxxxxxxxxxxxxxxx/

[4] Some flamegraphs are attached,
- SMC using vzalloced buffer: vzalloc_t96.svg
- SMC using vzalloced buffer and with this patchset: vzalloc_t96_improve.svg
- SMC using vzalloced buffer and unset CONFIG_HARDENED_USERCOPY: vzalloc_t96_nocheck.svg
- SMC using kzalloced buffer: kzalloc_t96.svg


Best regards,
Wen Gu

Attachment: vzalloc_t96.svg
Description: image/svg

Attachment: vzalloc_t96_improve.svg
Description: image/svg

Attachment: vzalloc_t96_nocheck.svg
Description: image/svg

Attachment: kzalloc_t96.svg
Description: image/svg