[PATCH next v2 0/5] locking/osq_lock: Optimisations to osq_lock code.

From: David Laight
Date: Sun Dec 31 2023 - 16:50:26 EST


This is an updated series of optimisations to osq_lock.c
Patches #1 and #3 from v1 have been applied by Linus.
Some of the generated code issues I was getting were caused by
CONFIG_DEBUG_PREEMPT being set. No idea why, it isn't any more.

Patch #1 is the node->locked part of the old #2.

Patch #2 removes the pretty much guaranteed cache line reload getting
the cpu number (from node->prev) for the vcpu_is_preempted() check.
It is (basically) the old #5 with the addition of a READ_ONCE()
and leaving the '+ 1' offset (for patch 3).

Patch #3 ends up removing both node->cpu and node->prev.
This saves issues initialising node->cpu.
Basically node->cpu was only ever read as node->prev->cpu in the unqueue code.
Most of the time it is the value read from lock->tail that was used to
obtain 'prev' in the first place.
The only time it is different is in the unlock race path where 'prev'
is re-read from node->prev - updated right at the bottom of osq_lock().
So the updated node->prev_cpu can used (and prev obtained from it) without
worrying about only one of node->prev and node->prev-cpu being updated.

Linus did suggest just saving the cpu numbers instead of pointers.
It actually works for 'prev' but not 'next'.

Patch #4 removes the 'should be unnecessary' node->next = NULL
assignment from the top of osq_lock().
Since longman was worried about race conditions, I've added a
WARN_ON_ONCE() check that ensures it is NULL.
This saves dirtying the 'node' cache line in the fast path, but the
check still requires the cache line be loaded.

Patch #5 just stops gcc using two separate instructions to decrement
the offset cpu number and then convert it to 64 bits.
Linus got annoyed with it, and I'd spotted it as well.
I don't seem to be able to get gcc to convert __per_cpu_offset[cpu - 1]
to (__per_cpu_offset - 1)[cpu] (cpu is offset by one) but, in any case,
it would still need zero extending in the common case.

David Laight (5):
1) Defer clearing node->locked until the slow osq_lock() path.
2) Optimise vcpu_is_preempted() check.
3) Use node->prev_cpu instead of saving node->prev.
4) Avoid writing to node->next in the osq_lock() fast path.
5) Optimise decode_cpu() and per_cpu_ptr().

kernel/locking/osq_lock.c | 59 +++++++++++++++++++++------------------
1 file changed, 32 insertions(+), 27 deletions(-)

--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)