Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

From: Linus Torvalds
Date: Sat Dec 30 2023 - 15:41:45 EST


On Fri, 29 Dec 2023 at 12:57, David Laight <David.Laight@xxxxxxxxxx> wrote:
>
> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> the latter can use an 'offset from register' (%gs for x86-84).
>
> Add a 'self' field to 'struct optimistic_spin_node' that can be
> read with raw_cpu_read(), initialise on first call.

No, this is horrible.

The problem isn't the "this_cpu_ptr()", it's the rest of the code.

> bool osq_lock(struct optimistic_spin_queue *lock)
> {
> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);

No. Both of these are crap.

> struct optimistic_spin_node *prev, *next;
> int old;
>
> - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> - node->cpu = encode_cpu(smp_processor_id());
> + if (unlikely(!node)) {
> + int cpu = encode_cpu(smp_processor_id());
> + node = decode_cpu(cpu);
> + node->self = node;
> + node->cpu = cpu;
> + }

The proper fix here is to not do that silly

node = this_cpu_ptr(&osq_node);
..
node->next = NULL;

dance at all, but to simply do

this_cpu_write(osq_node.next, NULL);

in the first place. That makes the whole thing just a single store off
the segment descriptor.

Yes, you'll eventually end up doing that

node = this_cpu_ptr(&osq_node);

thing because it then wants to use that raw pointer to do

WRITE_ONCE(prev->next, node);

but that's a separate issue and still does not make it worth it to
create a pointless self-pointer.

Btw, if you *really* want to solve that separate issue, then make the
optimistic_spin_node struct not contain the pointers at all, but the
CPU numbers, and then turn those numbers into the pointers the exact
same way it does for the "lock->tail" thing, ie doing that whole

prev = decode_cpu(old);

dance. That *may* then result in avoiding turning them into pointers
at all in some cases.

Also, I think that you might want to look into making OSQ_UNLOCKED_VAL
be -1 instead, and add something like

#define IS_OSQ_UNLOCKED(x) ((int)(x)<0)

and that would then avoid the +1 / -1 games in encoding/decoding the
CPU numbers. It causes silly code generated like this:

subl $1, %eax #, cpu_nr
...
cltq
addq __per_cpu_offset(,%rax,8), %rcx

which seems honestly stupid. The cltq is there for sign-extension,
which is because all these things are "int", and the "subl" will
zero-extend to 64-bit, not sign-extend.

At that point, I think gcc might be able to just generate

addq __per_cpu_offset-8(,%rax,8), %rcx

but honestly, I think it would be nicer to just have decode_cpu() do

unsigned int cpu_nr = encoded_cpu_val;

return per_cpu_ptr(&osq_node, cpu_nr);

and not have the -1/+1 at all.

Hmm?

UNTESTED patch to just do the "this_cpu_write()" parts attached.
Again, note how we do end up doing that this_cpu_ptr conversion later
anyway, but at least it's off the critical path.

Linus
kernel/locking/osq_lock.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 75a6f6133866..c3a166b7900c 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -92,14 +92,14 @@ osq_wait_next(struct optimistic_spin_queue *lock,

bool osq_lock(struct optimistic_spin_queue *lock)
{
- struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
+ struct optimistic_spin_node *node;
struct optimistic_spin_node *prev, *next;
int curr = encode_cpu(smp_processor_id());
int old;

- node->locked = 0;
- node->next = NULL;
- node->cpu = curr;
+ this_cpu_write(osq_node.next, NULL);
+ this_cpu_write(osq_node.locked, 0);
+ this_cpu_write(osq_node.cpu, curr);

/*
* We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -112,7 +112,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
return true;

prev = decode_cpu(old);
- node->prev = prev;
+ this_cpu_write(osq_node.prev, prev);
+
+ node = this_cpu_ptr(&osq_node);

/*
* osq_lock() unqueue