Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits

From: Waiman Long
Date: Tue May 09 2023 - 10:35:11 EST

Next message: Jason-JH Lin (林睿祥): "Re: [PATCH v3 00/11] MediaTek DDP GAMMA - 12-bit LUT support"
Previous message: Simon Horman: "Re: [PATCH v2 4/5] net/ncsi: add shift MAC address property"
In reply to: Zhuo, Qiuxu: "RE: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits"
Next in thread: kernel test robot: "Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 5/8/23 23:30, Zhuo, Qiuxu wrote:

From: Waiman Long <longman@xxxxxxxxxx>
Sent: Tuesday, May 9, 2023 10:58 AM
To: Zhuo, Qiuxu <qiuxu.zhuo@xxxxxxxxx>; Peter Zijlstra
<peterz@xxxxxxxxxxxxx>; Ingo Molnar <mingo@xxxxxxxxxx>; Will Deacon
<will@xxxxxxxxxx>
Cc: Boqun Feng <boqun.feng@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx
Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
locked_pending bits

On 5/8/23 22:45, Zhuo, Qiuxu wrote:

Hi Waiman,

Thanks for your review of this patch.
Please see the comments below.

From: Waiman Long <longman@xxxxxxxxxx>
Sent: Monday, May 8, 2023 11:31 PM
To: Zhuo, Qiuxu <qiuxu.zhuo@xxxxxxxxx>; Peter Zijlstra
<peterz@xxxxxxxxxxxxx>; Ingo Molnar <mingo@xxxxxxxxxx>; Will Deacon
<will@xxxxxxxxxx>
Cc: Boqun Feng <boqun.feng@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx
Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only
spin on locked_pending bits

On 5/8/23 04:15, Qiuxu Zhuo wrote:

The 1st spinner (header of the MCS queue) spins on the whole
qspinlock variable to check whether the lock is released. For a
contended qspinlock, this spinning is a hotspot as each CPU queued
in the MCS queue performs the spinning when it becomes the 1st
spinner (header of

the MCS queue).

The granularity among SMT h/w threads in the same core could be

"byte"

which the Load-Store Unit (LSU) inside the core handles. Making the
1st spinner only spin on locked_pending bits (not the whole
qspinlock) can avoid the false dependency between the tail field and
the locked_pending field. So this micro-optimization helps the h/w
thread (the 1st spinner) stay in a low power state and prevents it
from being woken up by other h/w threads in the same core when they
perform
xchg_tail() to update the tail field. Please see a similar
discussion in the link

[1].

[1]
https://lore.kernel.org/r/20230105021952.3090070-1-

guoren@xxxxxxxxxx

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@xxxxxxxxx>
---
kernel/locking/qspinlock.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index efebbf19f887..e7b990b28610 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -513,7 +513,20 @@ void __lockfunc

queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

if ((val = pv_wait_head_or_lock(lock, node)))
goto locked;

+#if _Q_PENDING_BITS == 8
+ /*
+ * Spinning on the 2-byte locked_pending instead of the 4-byte

qspinlock

+ * variable can avoid the false dependency between the tail field and
+ * the locked_pending field. This helps the h/w thread (the 1st

spinner)

+ * stay in a low power state and prevents it from being woken up
+by

other

+ * h/w threads in the same core when they perform xchg_tail() to

update

+ * the tail field only.
+ */
+ smp_cond_load_acquire(&lock->locked_pending, !VAL);
+ val = atomic_read_acquire(&lock->val); #else
val = atomic_cond_read_acquire(&lock->val, !(VAL &
_Q_LOCKED_PENDING_MASK));
+#endif

locked:
/*

What hardware can benefit from this change? Do you have any micro-
benchmark that can show the performance benefit?

i) I don't have the hardware to measure the data.
But I run a benchmark [1] for the contended spinlock on an Intel

Sapphire Rapids

server (192 h/w threads, 2sockets) that showed that the 1st spinner

spinning was

a hotspot (contributed ~55% cache bouncing traffic measured by the

perf C2C.

I don't analyze the cache bouncing here, but just say the spinning is a

hotspot).
I believe the amount of cacheline bouncing will be the same whether you
read 32 or 16 bits from the lock word. At least this is my understanding of the
x86 arch. Please correct me if my assumption is incorrect.

You're right.
The amount of cache line bouncing was nearly the same either spinning 32 or 16 bits
(according to my measured perf C2C data on an x86 server).

ii) The similar micro-optimization discussion [2] (looked like it was accepted

by you 😉) that

avoiding the false dependency (between the tail field and the

locked_pending field)

should help some arches (e.g., some ARM64???) the h/w thread

(spinner) stay in a

low-power state without the disruption by its sibling h/w threads in the

same core.

That is true. However, this patch causes one more read from the lock
cacheline which isn't necessary for arches that won't benefit from it.
So I am less incline to accept it unless there is evidence of the
benefit it can bring.

This patch removes a bitwise AND operation on the VAL value.
Does it compensate for the one more read from the cache line?

Register to register operation in the case of bitwise AND doesn't cost much, it is the potential reading from a hot cacheline that cause most of the delay. Besides there is also an additional acquire barrier. Again, if there is no concrete proof of a benefit, there is no point to make the code more complicated.

Cheers,
Longman

Next message: Jason-JH Lin (林睿祥): "Re: [PATCH v3 00/11] MediaTek DDP GAMMA - 12-bit LUT support"
Previous message: Simon Horman: "Re: [PATCH v2 4/5] net/ncsi: add shift MAC address property"
In reply to: Zhuo, Qiuxu: "RE: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits"
Next in thread: kernel test robot: "Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]