Re: [PATCH] locking/rwsem: Optimize down_read_trylock() under highly contended case

From: Muchun Song
Date: Thu Nov 18 2021 - 21:53:40 EST


On Thu, Nov 18, 2021 at 8:57 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Thu, Nov 18, 2021 at 05:44:55PM +0800, Muchun Song wrote:
>
> > By using the above benchmark, the real executing time on a x86-64 system
> > before and after the patch were:
>
> What kind of x86_64 ?

Intel(R) Xeon(R) CPU E5-2650

>
> >
> > Before Patch After Patch
> > # of Threads real real reduced by
> > ------------ ------ ------ ----------
> > 1 65,373 65,206 ~0.0%
> > 4 15,467 15,378 ~0.5%
> > 40 6,214 5,528 ~11.0%
> >
> > For the uncontended case, the new down_read_trylock() is the same as
> > before. For the contended cases, the new down_read_trylock() is faster
> > than before. The more contended, the more fast.
> >
> > Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx>
> > ---
> > kernel/locking/rwsem.c | 11 ++++-------
> > 1 file changed, 4 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > index c51387a43265..ef2b2a3f508c 100644
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -1249,17 +1249,14 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
> >
> > DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
> >
> > - /*
> > - * Optimize for the case when the rwsem is not locked at all.
> > - */
> > - tmp = RWSEM_UNLOCKED_VALUE;
> > - do {
> > + tmp = atomic_long_read(&sem->count);
> > + while (!(tmp & RWSEM_READ_FAILED_MASK)) {
> > if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
> > - tmp + RWSEM_READER_BIAS)) {
> > + tmp + RWSEM_READER_BIAS)) {
> > rwsem_set_reader_owned(sem);
> > return 1;
> > }
> > - } while (!(tmp & RWSEM_READ_FAILED_MASK));
> > + }
> > return 0;
> > }
>
> This is weird... so the only difference is that leading load, but given
> contention you'd expect that load to stall, also, given it's a
> non-exclusive load, to get stolen by a competing CPU. Whereas the old
> code would start with a cmpxchg, which obviously will also stall, but
> does an exclusive load.

The old code always assumes that rwsem is unlocked at all, so the
initialization of local variable @tmp is 0 (RWSEM_UNLOCKED_VALUE).
Ideally, the first cmpxchg will be successful. However, if rwsem is
contended (e.g. mmap read lock is almost hold by other threads
), sem->count is not always RWSEM_UNLOCKED_VALUE, so the
first cmpxchg has to fail and the second cmpxchg usually succeeds.
This patch will load sem->count first to @tmp and then the cmpxchg
will usually succeed. So this patch reduces the number of executions
of cmpxchg if running the above benchmark.

I have added a patch as follows to verify this. I got the following
statistics by using bpftrace (bpftrace -e
'tracepoint:rwsem:rwsem_read_loop /comm == "a.out"/ {
@cnt[args->count] = count(); }').

Before patch:
@cnt[17]: 1
@cnt[11]: 6
@cnt[10]: 14
@cnt[9]: 62
@cnt[8]: 339
@cnt[1]: 1211
@cnt[7]: 1585
@cnt[6]: 7796
@cnt[5]: 34742
@cnt[4]: 158245
@cnt[3]: 921631
@cnt[2]: 25089070

After patch:
@cnt[9]: 1
@cnt[8]: 3
@cnt[7]: 28
@cnt[6]: 151
@cnt[0]: 197
@cnt[5]: 823
@cnt[4]: 5521
@cnt[3]: 39126
@cnt[2]: 328012
@cnt[1]: 25840840

As you can see the total number of executions of cmpxchg is
reduced by about ~1x.

Thanks.

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ef2b2a3f508c..cdeb1ad1921f 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1246,14 +1246,18 @@ static inline int __down_read_killable(struct
rw_semaphore *sem)
static inline int __down_read_trylock(struct rw_semaphore *sem)
{
long tmp;
+ int count = 0;

DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);

tmp = atomic_long_read(&sem->count);
while (!(tmp & RWSEM_READ_FAILED_MASK)) {
+ count++;
if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
tmp + RWSEM_READER_BIAS)) {
rwsem_set_reader_owned(sem);
+ if (&current->mm->mmap_lock == sem)
+ trace_rwsem_read_loop(count);
return 1;
}
}

>
> And the thinking is that the exclusive load and the presence of the
> cmpxchg loop would keep the line on that CPU for a little while and
> progress is made.
>
> Clearly this isn't working as expected. Also I suppose it would need
> wider testing...
>