Re: [PATCH] x86/smp: Use atomic_try_cmpxchg() to micro-optimize native_stop_other_cpus()

From: Uros Bizjak
Date: Fri Nov 17 2023 - 11:38:01 EST


On Fri, Nov 17, 2023 at 5:18 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> On 11/14/23 08:43, Uros Bizjak wrote:
> > Use atomic_try_cmpxchg() instead of atomic_cmpxchg(*ptr, old, new) == old
> > in native_stop_other_cpus(). On x86 the CMPXCHG instruction returns success
> > in the ZF flag, so this change saves a compare after CMPXCHG. Together
> > with a small code reorder, the generated asm code improves from:
> >
> > 74: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
> > 7a: 41 54 push %r12
> > 7c: 55 push %rbp
> > 7d: 65 8b 2d 00 00 00 00 mov %gs:0x0(%rip),%ebp
> > 84: 53 push %rbx
> > 85: 85 c0 test %eax,%eax
> > 87: 75 71 jne fa <native_stop_other_cpus+0x8a>
> > 89: b8 ff ff ff ff mov $0xffffffff,%eax
> > 8e: f0 0f b1 2d 00 00 00 lock cmpxchg %ebp,0x0(%rip)
> > 95: 00
> > 96: 83 f8 ff cmp $0xffffffff,%eax
> > 99: 75 5f jne fa <native_stop_other_cpus+0x8a>
> >
> > to:
> >
> > 74: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
> > 7a: 85 c0 test %eax,%eax
> > 7c: 0f 85 84 00 00 00 jne 106 <native_stop_other_cpus+0x96>
> > 82: 41 54 push %r12
> > 84: b8 ff ff ff ff mov $0xffffffff,%eax
> > 89: 55 push %rbp
> > 8a: 53 push %rbx
> > 8b: 65 8b 1d 00 00 00 00 mov %gs:0x0(%rip),%ebx
> > 92: f0 0f b1 1d 00 00 00 lock cmpxchg %ebx,0x0(%rip)
> > 99: 00
> > 9a: 75 5e jne fa <native_stop_other_cpus+0x8a>
> >
> > Please note early exit and lack of CMP after CMPXCHG.
>
> Uros, I really do appreciate that you are trying to optimize these
> paths. But the thing we have to balance is the _need_ for optimization
> with the chance that this will break something.
>
> This is about as much of a slow path as we have in the kernel. It's
> only used at shutdown, right? That means this is one of the places in
> the kernel that least needs optimization. It can only possibly get run
> once per boot.
>
> So, the benefit is that it might make this code a few cycles faster. In
> practice, it might not even be measurably faster.
>
> On the other hand, this is relatively untested and also makes the C code
> more complicated.
>
> Is there some other side benefit that I'm missing here? Applying this
> patch doesn't seem to have a great risk/reward ratio.

Yes, in addition to better asm code, I think that the use of magic
constant (-1) is not descriptive at all. I tried to make this code
look like nmi_panic() from kernel/panic.c, which has similar
functionality, and describe that this constant belongs to old_cpu
(same as in nmi_panic() ). Also, from converting many cmpxchg to
try_cmpxchg, it becomes evident that in cases like this (usage in "if"
clauses) the correct locking primitive is try_cmpxchg. Additionally,
in this particular case, it is not the speed, but a little code save
that can be achieved with the same functionality.

Thanks,
Uros.