Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

From: Peter Zijlstra
Date: Tue Jul 04 2023 - 05:47:04 EST


On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:

> int x = 0;
> int y = 0;
> int r0, r1;
>
> int dummy;
>
> void t0(void)
> {
> __atomic_store_n(&x, 1, __ATOMIC_RELAXED);
>
> __atomic_exchange_n(&dummy, 1, __ATOMIC_SEQ_CST);
> __atomic_thread_fence(__ATOMIC_SEQ_CST);
>
> r0 = __atomic_load_n(&y, __ATOMIC_RELAXED);
> }
>
> void t1(void)
> {
> __atomic_store_n(&y, 1, __ATOMIC_RELAXED);
> __atomic_thread_fence(__ATOMIC_SEQ_CST);
> r1 = __atomic_load_n(&x, __ATOMIC_RELAXED);
> }
>
> // BUG_ON(r0 == 0 && r1 == 0)
>
> On x86-64 (gcc 13.1 -O2) we get:
>
> t0():
> movl $1, x(%rip)
> movl $1, %eax
> xchgl dummy(%rip), %eax
> lock orq $0, (%rsp) ;; Redundant with previous exchange.
> movl y(%rip), %eax
> movl %eax, r0(%rip)
> ret
> t1():
> movl $1, y(%rip)
> lock orq $0, (%rsp)
> movl x(%rip), %eax
> movl %eax, r1(%rip)
> ret

So I would expect the compilers to do better here. It should know those
__atomic_thread_fence() thingies are superfluous and simply not emit
them. This could even be done as a peephole pass later, where it sees
consecutive atomic ops and the second being a no-op.

> On x86-64 (clang 16 -O2) we get:
>
> t0():
> movl $1, x(%rip)
> movl $1, %eax
> xchgl %eax, dummy(%rip)
> mfence ;; Redundant with previous exchange.

And that's just terrible :/ Nobody should be using MFENCE for this. And
using MFENCE after a LOCK prefixes instruction (implicit in this case)
is just fail, because I don't think C++ atomics cover MMIO and other
such 'lovely' things.

> movl y(%rip), %eax
> movl %eax, r0(%rip)
> retq
> t1():
> movl $1, y(%rip)
> mfence
> movl x(%rip), %eax
> movl %eax, r1(%rip)
> retq