Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization

From: Adrian Ratiu
Date: Mon Nov 09 2020 - 14:51:56 EST


On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulniers@xxxxxxxxxx> wrote:
On Fri, Nov 6, 2020 at 3:50 AM Adrian Ratiu <adrian.ratiu@xxxxxxxxxxxxx> wrote:

Hi Nathan,
On Fri, 06 Nov 2020, Nathan Chancellor <natechancellor@xxxxxxxxx> wrote: > + Ard, who wrote this code. > > On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote: >> Due to a Clang bug [1] neon autoloop vectorization does not >> happen or happens badly with no gains and considering >> previous GCC experiences which generated unoptimized code >> which was worse than the default asm implementation, it is >> safer to default clang builds to the known good generic >> implementation. The kernel currently supports a minimum >> Clang version of v10.0.1, see commit 1f7a44f63e6c >> ("compiler-clang: add build check for clang 10.0.1"). When >> the bug gets eventually fixed, this commit could be reverted >> or, if the minimum clang version bump takes a long time, a >> warning could be added for users to upgrade their compilers >> like was done for GCC. [1] >> https://bugs.llvm.org/show_bug.cgi?id=40976 Signed-off-by: >> Adrian Ratiu <adrian.ratiu@xxxxxxxxxxxxx> > > Thank you for the patch! We are also tracking this here: > > https://github.com/ClangBuiltLinux/linux/issues/496 > > It was on my TODO to revist getting the warning eliminated, > which likely would have involved a patch like this as well. > > I am curious if it is worth revisting or dusting off Arnd's > patch in the LLVM bug tracker first. I have not tried it > personally. If that is not a worthwhile option, I am fine > with this for now. It would be nice to try and get a fix > pinned down on the LLVM side at some point but alas, finite > amount of resources and people :(
I tested Arnd's kernel patch from the LLVM bugtracker [1], but with the Clang v10.0.1 I still get warnings like the following even though the __restrict workaround seems to affect the generated instructions:
./include/asm-generic/xor.h:15:2: remark: the cost-model indicates that interleaving is not beneficial [-Rpass-missed=loop-vectorize] ./include/asm-generic/xor.h:11:1: remark: List vectorization was possible but not beneficial with cost 0 >= 0 [-Rpass-missed=slp-vectorizer] xor_8regs_2(unsigned long bytes, unsigned long *__restrict p1, unsigned long *__restrict p2)

If it's just a matter of overruling the cost model #pragma clang loop vectorize(enable)
will do the trick.
Indeed, ``` diff --git a/include/asm-generic/xor.h b/include/asm-generic/xor.h index b62a2a56a4d4..8796955498b7 100644 --- a/include/asm-generic/xor.h +++ b/include/asm-generic/xor.h @@ -12,6 +12,7 @@ xor_8regs_2(unsigned long bytes, unsigned long *p1, unsigned long *p2) { long lines = bytes / (sizeof (long)) / 8;
+#pragma clang loop vectorize(enable) do { p1[0] ^= p2[0]; p1[1] ^= p2[1]; @@ -32,6 +33,7 @@ xor_8regs_3(unsigned long bytes, unsigned long *p1, unsigned long *p2, { long lines = bytes / (sizeof (long)) / 8;
+#pragma clang loop vectorize(enable) do { p1[0] ^= p2[0] ^ p3[0]; p1[1] ^= p2[1] ^ p3[1]; @@ -53,6 +55,7 @@ xor_8regs_4(unsigned long bytes, unsigned long *p1, unsigned long *p2, { long lines = bytes / (sizeof (long)) / 8;
+#pragma clang loop vectorize(enable) do { p1[0] ^= p2[0] ^ p3[0] ^ p4[0]; p1[1] ^= p2[1] ^ p3[1] ^ p4[1]; @@ -75,6 +78,7 @@ xor_8regs_5(unsigned long bytes, unsigned long *p1, unsigned long *p2, { long lines = bytes / (sizeof (long)) / 8;
+#pragma clang loop vectorize(enable) do { p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^= p2[1] ^ p3[1] ^ p4[1] ^ p5[1]; ``` seems to generate the vectorized code.
Why don't we find a way to make those pragma's more toolchain portable, rather than open coding them like I have above rather than this series?

Hi again Nick,

How did you verify the above pragmas generate correct vectorized code? Have you tested this specific use case?

I'm asking because overrulling the cost model might not be enough, the only thing I can confirm is that the generated code is changed, but not that it is correct in any way. The object disasm also looks weird, but I don't have enough knowledge to start debugging what's happening within LLVM/Clang itself.

I also get some new warnings with your code [1], besides the previously 'vectorization was possible but not beneficial' which is still present. It is quite funny because these two warnings seem to contradict themselves. :)

At this point I do not trust the compiler and am inclined to do like was done for GCC when it was broken: disable the optimization and warn users to upgrade after the compiler is fixed and confirmed to work.

If you agree I can send a v2 with this and also drop the GCC pragma as Arvind and Ard suggested.

Kind regards,
Adrian

[1]
./include/asm-generic/xor.h:11:1: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] xor_8regs_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)


--
Thanks,
~Nick Desaulniers