Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: Linus Torvalds
Date: Wed Nov 15 2023 - 13:38:43 EST


On Wed, 15 Nov 2023 at 12:38, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> The gcc code generation here is *really* odd. I've never seen this
> before, so it may be new to newer versions of gcc. [...]

Ok, so I've been annoyed by gcc code generation for 'memcpy()' before,
mainly because it can't do the "let's inline it as an ALTERNATIVE() of
'rep movs' or an out-of-line call" thing that we do for user copies.

So here's a complete hack that says "we override the
__builtin_memcpy() implementation for non-constant lengths". We
stillwant gcc to deal with the constant size case, since often that
can be turned into just plain moves.

And this is *ENTIRELY* untested, except for checking that it makes the
code generation in lib/iov_iter.c look better.

It's probably completely broken. I particularly worry about "memcpy()"
being used *during* the instruction rewriting, so the whole approach
with ALTERNATIVE() is probably entirely broken.

But I'm cc'ing Borislav, because we've talked about the whole "inline
memcpy() with alternatives" before, so maybe this gives Borislav some
ideas for how to do it right.

Borislav, see

https://lore.kernel.org/all/CAHk-=wjCUckvZUQf7gqp2ziJUWxVpikM_6srFdbcNdBJTxExRg@xxxxxxxxxxxxxx/

for some truly crazy code generation by gcc.

Linus
arch/x86/include/asm/string_64.h | 22 ++++++++++++++++++++++
include/linux/fortify-string.h | 1 +
2 files changed, 23 insertions(+)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 857d364b9888..c623f416789a 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -4,6 +4,10 @@

#ifdef __KERNEL__
#include <linux/jump_label.h>
+#include <linux/compiler.h>
+#include <linux/const.h>
+#include <asm/alternative.h>
+#include <asm/cpufeatures.h>

/* Written 2002 by Andi Kleen */

@@ -18,6 +22,24 @@
extern void *memcpy(void *to, const void *from, size_t len);
extern void *__memcpy(void *to, const void *from, size_t len);

+extern __always_inline void *__rep_movs_memcpy(void *to, const void *from, size_t len)
+{
+ asm volatile(ALTERNATIVE(
+ "rep movsb",
+ "call rep_movs_alternative",
+ ALT_NOT(X86_FEATURE_FSRM))
+ :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
+ : : "memory", "rax");
+ return to;
+}
+
+#define __builtin_memcpy(to, from, len) \
+ __builtin_choose_expr(__is_constexpr(len), \
+ __builtin_memcpy(to, from, len), \
+ __rep_movs_memcpy(to, from, len))
+
+#define memcpy __builtin_memcpy
+
#define __HAVE_ARCH_MEMSET
void *memset(void *s, int c, size_t n);
void *__memset(void *s, int c, size_t n);
diff --git a/include/linux/fortify-string.h b/include/linux/fortify-string.h
index da51a83b2829..cb778f68f537 100644
--- a/include/linux/fortify-string.h
+++ b/include/linux/fortify-string.h
@@ -690,6 +690,7 @@ __FORTIFY_INLINE bool fortify_memcpy_chk(__kernel_size_t size,
* __struct_size() vs __member_size() must be captured here to avoid
* evaluating argument side-effects further into the macro layers.
*/
+#undef memcpy
#define memcpy(p, q, s) __fortify_memcpy_chk(p, q, s, \
__struct_size(p), __struct_size(q), \
__member_size(p), __member_size(q), \