From: Ingo Molnar
Sent: 30 December 2023 11:09My changed code is one instruction shorter!
* Waiman Long <longman@xxxxxxxxxx> wrote:
On 12/29/23 15:57, David Laight wrote:
this_cpu_ptr() is rather more expensive than raw_cpu_read() sinceMy gcc 11 compiler produces the following x86-64 code:
the latter can use an 'offset from register' (%gs for x86-84).
Add a 'self' field to 'struct optimistic_spin_node' that can be
read with raw_cpu_read(), initialise on first call.
Signed-off-by: David Laight <david.laight@xxxxxxxxxx>
---
kernel/locking/osq_lock.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 9bb3a077ba92..b60b0add0161 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,7 +13,7 @@
*/
struct optimistic_spin_node {
- struct optimistic_spin_node *next, *prev;
+ struct optimistic_spin_node *self, *next, *prev;
int locked; /* 1 if lock acquired */
int cpu; /* encoded CPU # + 1 value */
};
@@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
bool osq_lock(struct optimistic_spin_queue *lock)
{
- struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
+ struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
92 struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
0x0000000000000029 <+25>: mov %rcx,%rdx
0x000000000000002c <+28>: add %gs:0x0(%rip),%rdx # 0x34
<osq_lock+36>
Which looks pretty optimized for me. Maybe older compiler may generate more
complex code. However, I do have some doubt as to the benefit of this patch
at the expense of making the code a bit more complex.
18: 65 48 8b 15 00 00 00 mov %gs:0x0(%rip),%rdx # 20 <osq_lock+0x20>
1f: 00
1c: R_X86_64_PC32 .data..percpu..shared_aligned-0x4
However is might have one less cache line miss.
GCC-11 is plenty of a look-back window in terms of compiler efficiency:There must be a difference in the header files as well.
latest enterprise distros use GCC-11 or newer, while recent desktop
distros use GCC-13. Anything older won't matter, because no major
distribution is going to use new kernels with old compilers.
Possibly forced by the older compiler I'm using (7.5 from Ubuntu 18.04).
But maybe based on some config option.
I'm seeing this_cpu_ptr(&xxx) converted to per_cpu_ptr(&xxx, smp_processor_id())
which necessitates an array lookup (indexed by cpu number).
Whereas I think you are seeing it implemented as
raw_cpu_read(per_cpu_data_base) + offset_to(xxx)
So the old code generates (after the prologue):
10: 49 89 fd mov %rdi,%r13
13: 49 c7 c4 00 00 00 00 mov $0x0,%r12
16: R_X86_64_32S .data..percpu..shared_aligned
1a: e8 00 00 00 00 callq 1f <osq_lock+0x1f>
1b: R_X86_64_PC32 debug_smp_processor_id-0x4
1f: 89 c0 mov %eax,%eax
21: 48 8b 1c c5 00 00 00 mov 0x0(,%rax,8),%rbx
28: 00
25: R_X86_64_32S __per_cpu_offset
29: e8 00 00 00 00 callq 2e <osq_lock+0x2e>
2a: R_X86_64_PC32 debug_smp_processor_id-0x4
2e: 4c 01 e3 add %r12,%rbx
31: 83 c0 01 add $0x1,%eax
34: c7 43 10 00 00 00 00 movl $0x0,0x10(%rbx)
3b: 48 c7 03 00 00 00 00 movq $0x0,(%rbx)
42: 89 43 14 mov %eax,0x14(%rbx)
45: 41 87 45 00 xchg %eax,0x0(%r13)
I was also surprised that smp_processor_id() is a real function rather
than an offset from %gs.