Re: [PATCH] locking/osq_lock: Optimize osq_lock performance using per-NUMA

From: Waiman Long
Date: Tue Feb 20 2024 - 13:16:53 EST



On 2/20/24 02:30, Guo Hui wrote:
After extensive testing of osq_lock,
we found that the performance of osq_lock is closely related to
the distance between NUMA nodes.The greater the distance
between NUMA nodes,the more serious the performance degradation of
osq_lock.When a group of processes that need to compete for
the same lock are on the same NUMA node,the performance of osq_lock
is the best.when the group of processes is distributed on
different NUMA nodes,as the distance between NUMA nodes increases,
the performance of osq_lock becomes worse.

This patch uses the following solutions to improve performance:
Divide the osq_lock linked list according to NUMA nodes.
Each NUMA node corresponds to an osq linked list.
Each CPU is added to the linked list corresponding to
its respective NUMA node.When the last CPU of
the NUMA node releases osq_lock,osq_lock is passed to
the next NUMA node.

As shown in the figure below, the last osq_node1 on NUMA0 passes the lock
to the first node (osq_node3) of the next NUMA1 node.

-----------------------------------------------------------
| NUMA0 | NUMA1 |
|----------------------------|----------------------------|
| osq_node0 ---> osq_node1 -|-> osq_node3 ---> osq_node4 |
-----------------------------|-----------------------------

Set an atomic type global variable osq_lock_node to
record the NUMA node number that can currently obtain
the osq_lock lock.When the osq_lock_node value is
a certain node number,the CPU on the node obtains
the osq_lock lock in turn,and the CPUs on
other NUMA nodes poll wait.

This solution greatly reduces the performance degradation caused
by communication between CPUs on different NUMA nodes.

The effect on the 96-core 4-NUMA ARM64 platform is as follows:
System Benchmarks Partial Index with patch without patch promote
File Copy 1024 bufsize 2000 maxblocks 2060.8 980.3 +110.22%
File Copy 256 bufsize 500 maxblocks 1346.5 601.9 +123.71%
File Copy 4096 bufsize 8000 maxblocks 4229.9 2216.1 +90.87%

The effect on the 128-core 8-NUMA X86_64 platform is as follows:
System Benchmarks Partial Index with patch without patch promote
File Copy 1024 bufsize 2000 maxblocks 841.1 553.7 +51.91%
File Copy 256 bufsize 500 maxblocks 517.4 339.8 +52.27%
File Copy 4096 bufsize 8000 maxblocks 2058.4 1392.8 +47.79%
That is similar in idea to the numa-aware qspinlock patch series.
Signed-off-by: Guo Hui <guohui@xxxxxxxxxxxxx>
---
include/linux/osq_lock.h | 20 +++++++++++--
kernel/locking/osq_lock.c | 60 +++++++++++++++++++++++++++++++++------
2 files changed, 69 insertions(+), 11 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index ea8fb31379e3..c016c1cf5e8b 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -2,6 +2,8 @@
#ifndef __LINUX_OSQ_LOCK_H
#define __LINUX_OSQ_LOCK_H
+#include <linux/nodemask.h>
+
/*
* An MCS like lock especially tailored for optimistic spinning for sleeping
* lock implementations (mutex, rwsem, etc).
@@ -11,8 +13,9 @@ struct optimistic_spin_queue {
/*
* Stores an encoded value of the CPU # of the tail node in the queue.
* If the queue is empty, then it's set to OSQ_UNLOCKED_VAL.
+ * The actual number of NUMA nodes is generally not greater than 32.
*/
- atomic_t tail;
+ atomic_t tail[32];

That is a no-go. You are increasing the size of a mutex/rwsem by 128 bytes. If you want to enable this numa-awareness, you have to do it in a way without increasing the size of optimistic_spin_queue. My suggestion is to queue optimistic_spin_node in a numa-aware way in osq_lock.c without touching optimistic_spin_queue.

Cheers,
Longman