Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

From: Ingo Molnar
Date: Wed Oct 04 2023 - 12:40:53 EST



* Ingo Molnar <mingo@xxxxxxxxxx> wrote:

>
> * Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
>
> > The percpu code mostly uses inline assembly. Using segment qualifiers
> > allows to use C code instead, which enables the compiler to perform
> > various optimizations (e.g. propagation of memory arguments). Convert
> > percpu read and write accessors to C code, so the memory argument can
> > be propagated to the instruction that uses this argument.
> >
> > Some examples of propagations:
> >
> > a) into sign/zero extensions:
> >
> > 110b54: 65 0f b6 05 00 00 00 movzbl %gs:0x0(%rip),%eax
> > 11ab90: 65 0f b6 15 00 00 00 movzbl %gs:0x0(%rip),%edx
> > 14484a: 65 0f b7 35 00 00 00 movzwl %gs:0x0(%rip),%esi
> > 1a08a9: 65 0f b6 43 78 movzbl %gs:0x78(%rbx),%eax
> > 1a08f9: 65 0f b6 43 78 movzbl %gs:0x78(%rbx),%eax
> >
> > 4ab29a: 65 48 63 15 00 00 00 movslq %gs:0x0(%rip),%rdx
> > 4be128: 65 4c 63 25 00 00 00 movslq %gs:0x0(%rip),%r12
> > 547468: 65 48 63 1f movslq %gs:(%rdi),%rbx
> > 5474e7: 65 48 63 0a movslq %gs:(%rdx),%rcx
> > 54d05d: 65 48 63 0d 00 00 00 movslq %gs:0x0(%rip),%rcx
>
> Could you please also quote a 'before' assembly sequence, at least once
> per group of propagations?

Ie. for any changes to x86 code generation, please follow the changelog
format of:

7c097ca50d2b ("x86/percpu: Do not clobber %rsi in percpu_{try_,}cmpxchg{64,128}_op")

...
Move the load of %rsi outside inline asm, so the compiler can
reuse the value. The code in slub.o improves from:

55ac: 49 8b 3c 24 mov (%r12),%rdi
55b0: 48 8d 4a 40 lea 0x40(%rdx),%rcx
55b4: 49 8b 1c 07 mov (%r15,%rax,1),%rbx
55b8: 4c 89 f8 mov %r15,%rax
55bb: 48 8d 37 lea (%rdi),%rsi
55be: e8 00 00 00 00 callq 55c3 <...>
55bf: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4
55c3: 75 a3 jne 5568 <...>
55c5: ...

0000000000000000 <.altinstr_replacement>:
5: 65 48 0f c7 0f cmpxchg16b %gs:(%rdi)

to:

55ac: 49 8b 34 24 mov (%r12),%rsi
55b0: 48 8d 4a 40 lea 0x40(%rdx),%rcx
55b4: 49 8b 1c 07 mov (%r15,%rax,1),%rbx
55b8: 4c 89 f8 mov %r15,%rax
55bb: e8 00 00 00 00 callq 55c0 <...>
55bc: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4
55c0: 75 a6 jne 5568 <...>
55c2: ...

Where the alternative replacement instruction now uses %rsi:

0000000000000000 <.altinstr_replacement>:
5: 65 48 0f c7 0e cmpxchg16b %gs:(%rsi)

The instruction (effectively a reg-reg move) at 55bb: in the original
assembly is removed. Also, both the CALL and replacement CMPXCHG16B
are 5 bytes long, removing the need for NOPs in the asm code.
...

Thanks,

Ingo