Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization

From: Arvind Sankar
Date: Sun Nov 08 2020 - 13:09:58 EST


On Sun, Nov 08, 2020 at 12:40:14PM -0500, Arvind Sankar wrote:
> On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> > Due to a Clang bug [1] neon autoloop vectorization does not happen or
> > happens badly with no gains and considering previous GCC experiences
> > which generated unoptimized code which was worse than the default asm
> > implementation, it is safer to default clang builds to the known good
> > generic implementation.
> >
> > The kernel currently supports a minimum Clang version of v10.0.1, see
> > commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1").
> >
> > When the bug gets eventually fixed, this commit could be reverted or,
> > if the minimum clang version bump takes a long time, a warning could
> > be added for users to upgrade their compilers like was done for GCC.
> >
> > [1] https://bugs.llvm.org/show_bug.cgi?id=40976
> >
> > Signed-off-by: Adrian Ratiu <adrian.ratiu@xxxxxxxxxxxxx>
> > ---
> > arch/arm/include/asm/xor.h | 3 ++-
> > arch/arm/lib/Makefile | 3 +++
> > arch/arm/lib/xor-neon.c | 4 ++++
> > 3 files changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> > index aefddec79286..49937dafaa71 100644
> > --- a/arch/arm/include/asm/xor.h
> > +++ b/arch/arm/include/asm/xor.h
> > @@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
> > NEON_TEMPLATES; \
> > } while (0)
> >
> > -#ifdef CONFIG_KERNEL_MODE_NEON
> > +/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 */
> > +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
> >
> > extern struct xor_block_template const xor_block_neon_inner;
> >
> > diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> > index 6d2ba454f25b..53f9e7dd9714 100644
> > --- a/arch/arm/lib/Makefile
> > +++ b/arch/arm/lib/Makefile
> > @@ -43,8 +43,11 @@ endif
> > $(obj)/csumpartialcopy.o: $(obj)/csumpartialcopygeneric.S
> > $(obj)/csumpartialcopyuser.o: $(obj)/csumpartialcopygeneric.S
> >
> > +# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
> > +ifndef CONFIG_CC_IS_CLANG
> > ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
> > NEON_FLAGS := -march=armv7-a -mfloat-abi=softfp -mfpu=neon
> > CFLAGS_xor-neon.o += $(NEON_FLAGS)
> > obj-$(CONFIG_XOR_BLOCKS) += xor-neon.o
> > endif
> > +endif
> > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > index e1e76186ec23..84c91c48dfa2 100644
> > --- a/arch/arm/lib/xor-neon.c
> > +++ b/arch/arm/lib/xor-neon.c
> > @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
> > * Pull in the reference implementations while instructing GCC (through
> > * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> > * NEON instructions.
> > +
> > + * On Clang the loop vectorizer is enabled by default, but due to a bug
> > + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
> > + * so xor-neon is disabled in favor of the default reg implementations.
> > */
> > #ifdef CONFIG_CC_IS_GCC
> > #pragma GCC optimize "tree-vectorize"
> > --
> > 2.29.0
> >
>
> It's actually a bad idea to use #pragma GCC optimize. This is basically
> the same as tagging all the functions with __attribute__((optimize)),
> which GCC does not recommend for production use, as it _replaces_
> optimization options rather than appending to them, and has been
> observed to result in dropping important compiler flags.
>
> There've been a few discussions recently around other such cases:
> https://lore.kernel.org/lkml/20201028171506.15682-1-ardb@xxxxxxxxxx/
> https://lore.kernel.org/lkml/20201028081123.GT2628@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>
> For this file, given that it is supposed to use -ftree-vectorize for the
> whole file anyway, is there any reason it's not just added to CFLAGS via
> the Makefile? This seems to be the only use of pragma optimize in the
> kernel.

Eg, this shows that the pragma results in dropping -fno-strict-aliasing.
https://godbolt.org/z/1nfrKT

The first function does not use vectorization because s and s->a might
alias.