RE: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits

From: Zhuo, Qiuxu
Date: Mon May 08 2023 - 22:45:47 EST


Hi Waiman,

Thanks for your review of this patch.
Please see the comments below.

> From: Waiman Long <longman@xxxxxxxxxx>
> Sent: Monday, May 8, 2023 11:31 PM
> To: Zhuo, Qiuxu <qiuxu.zhuo@xxxxxxxxx>; Peter Zijlstra
> <peterz@xxxxxxxxxxxxx>; Ingo Molnar <mingo@xxxxxxxxxx>; Will Deacon
> <will@xxxxxxxxxx>
> Cc: Boqun Feng <boqun.feng@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
> locked_pending bits
>
>
> On 5/8/23 04:15, Qiuxu Zhuo wrote:
> > The 1st spinner (header of the MCS queue) spins on the whole qspinlock
> > variable to check whether the lock is released. For a contended
> > qspinlock, this spinning is a hotspot as each CPU queued in the MCS
> > queue performs the spinning when it becomes the 1st spinner (header of
> the MCS queue).
> >
> > The granularity among SMT h/w threads in the same core could be "byte"
> > which the Load-Store Unit (LSU) inside the core handles. Making the
> > 1st spinner only spin on locked_pending bits (not the whole qspinlock)
> > can avoid the false dependency between the tail field and the
> > locked_pending field. So this micro-optimization helps the h/w thread
> > (the 1st spinner) stay in a low power state and prevents it from being
> > woken up by other h/w threads in the same core when they perform
> > xchg_tail() to update the tail field. Please see a similar discussion in the link
> [1].
> >
> > [1]
> > https://lore.kernel.org/r/20230105021952.3090070-1-guoren@xxxxxxxxxx
> >
> > Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@xxxxxxxxx>
> > ---
> > kernel/locking/qspinlock.c | 13 +++++++++++++
> > 1 file changed, 13 insertions(+)
> >
> > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > index efebbf19f887..e7b990b28610 100644
> > --- a/kernel/locking/qspinlock.c
> > +++ b/kernel/locking/qspinlock.c
> > @@ -513,7 +513,20 @@ void __lockfunc
> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > if ((val = pv_wait_head_or_lock(lock, node)))
> > goto locked;
> >
> > +#if _Q_PENDING_BITS == 8
> > + /*
> > + * Spinning on the 2-byte locked_pending instead of the 4-byte
> qspinlock
> > + * variable can avoid the false dependency between the tail field and
> > + * the locked_pending field. This helps the h/w thread (the 1st
> spinner)
> > + * stay in a low power state and prevents it from being woken up by
> other
> > + * h/w threads in the same core when they perform xchg_tail() to
> update
> > + * the tail field only.
> > + */
> > + smp_cond_load_acquire(&lock->locked_pending, !VAL);
> > + val = atomic_read_acquire(&lock->val); #else
> > val = atomic_cond_read_acquire(&lock->val, !(VAL &
> > _Q_LOCKED_PENDING_MASK));
> > +#endif
> >
> > locked:
> > /*
>
> What hardware can benefit from this change? Do you have any micro-
> benchmark that can show the performance benefit?

i) I don't have the hardware to measure the data.
But I run a benchmark [1] for the contended spinlock on an Intel Sapphire Rapids
server (192 h/w threads, 2sockets) that showed that the 1st spinner spinning was
a hotspot (contributed ~55% cache bouncing traffic measured by the perf C2C.
I don't analyze the cache bouncing here, but just say the spinning is a hotspot).

ii) The similar micro-optimization discussion [2] (looked like it was accepted by you 😉) that
avoiding the false dependency (between the tail field and the locked_pending field)
should help some arches (e.g., some ARM64???) the h/w thread (spinner) stay in a
low-power state without the disruption by its sibling h/w threads in the same core.

[1] https://github.com/yamahata/ipi_benchmark.git
[2] https://lore.kernel.org/r/20230105021952.3090070-1-guoren@xxxxxxxxxx

Thanks
-Qiuxu

> Cheers,
> Longman