Re: [PATCH] x86: Add an explicit barrier() to clflushopt()

From: Ross Zwisler
Date: Mon Oct 19 2015 - 14:29:19 EST


On Mon, Oct 19, 2015 at 10:58:55AM +0100, Chris Wilson wrote:
> During testing we observed that the last cacheline was not being flushed
> from a
>
> mb()
> for (addr = addr & -clflush_size; addr < end; addr += clflush_size)
> clflushopt();
> mb()
>
> loop (where the initial addr and end were not cacheline aligned).
>
> Changing the loop from addr < end to addr <= end, or replacing the
> clflushopt() with clflush() both fixed the testcase. Hinting that GCC
> was miscompling the assembly within the loop and specifically the
> alternative within clflushopt() was confusing the loop optimizer.
>
> Adding a barrier() into clflushopt() is enough for GCC to dtrt, but
> solving why GCC is not seeing the constraints from the alternative_io()
> would be smarter...
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92501
> Testcase: gem_tiled_partial_pwrite_pread/read
> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> Cc: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
> Cc: H. Peter Anvin <hpa@xxxxxxxxxxxxxxx>
> Cc: Imre Deak <imre.deak@xxxxxxxxx>
> Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>
> Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx
> ---
> arch/x86/include/asm/special_insns.h | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
> index 2270e41b32fd..0c7aedbf8930 100644
> --- a/arch/x86/include/asm/special_insns.h
> +++ b/arch/x86/include/asm/special_insns.h
> @@ -199,6 +199,11 @@ static inline void clflushopt(volatile void *__p)
> ".byte 0x66; clflush %P0",
> X86_FEATURE_CLFLUSHOPT,
> "+m" (*(volatile char __force *)__p));
> + /* GCC (4.9.1 and 5.2.1 at least) appears to be very confused when
> + * meeting this alternative() and demonstrably miscompiles loops
> + * iterating over clflushopts.
> + */
> + barrier();
> }
>
> static inline void clwb(volatile void *__p)
> --
> 2.6.1

I'm having trouble recreating this - can you help me get a simple reproducer?

I've been using the patch at the end of this mail as a test case.

If the issue you are seeing is indeed a compiler error, the issue should be
visible in the resulting assembly. I've dumped the assembly using objdump for
new loop in pmem_init() using both clflush() and clflushopt() in the loop, and
the loops are exactly the same except the clflushopt case has an extra byte as
part of the clflush instruction which is our NOOP prefix - we need this so
that the instruction is the right length so we can swap clflushopt in.

clflush() in loop:
3a: 0f ae 3a clflush (%rdx)

clflushopt() in loop:
3a: 3e 0f ae 3a clflush %ds:(%rdx)

We also get an extra entry in the <.altinstr_replacement> section in the
clflushopt case, as expected:

15: 66 data16
16: 0f ae 3a clflush (%rdx)

This all looks correct to me. I've tested with both GCC 4.9.2 and GCC 5.1.1,
and they behave the same.

I also tried inserting a barrier() in the clflushopt() inline function, and it
didn't change the resulting assembly for me.

What am I missing?

- Ross

-- 8< --