[PATCH] x86: Alias memset to __builtin_memset.

From: Clement Courbet
Date: Mon Mar 23 2020 - 07:44:35 EST


Recent compilers know the meaning of builtins (`memset`,
`memcpy`, ...) and can replace calls by inline code when
deemed better. For example, `memset(p, 0, 4)` will be lowered
to a four-byte zero store.

When using -ffreestanding (this is the case e.g. building on
clang), these optimizations are disabled. This means that **all**
memsets, including those with small, constant sizes, will result
in an actual call to memset.

We have identified several spots where we have high CPU usage
because of this. For example, a single one of these memsets is
responsible for about 0.3% of our total CPU usage in the kernel.

Aliasing `memset` to `__builtin_memset` allows the compiler to
perform this optimization even when -ffreestanding is used.
There is no change when -ffreestanding is not used.

Below is a diff (clang) for `update_sg_lb_stats()`, which
includes the aforementionned hot memset:
memset(sgs, 0, sizeof(*sgs));

Diff:
movq %rsi, %rbx ~~~> movq $0x0, 0x40(%r8)
movq %rdi, %r15 ~~~> movq $0x0, 0x38(%r8)
movl $0x48, %edx ~~~> movq $0x0, 0x30(%r8)
movq %r8, %rdi ~~~> movq $0x0, 0x28(%r8)
xorl %esi, %esi ~~~> movq $0x0, 0x20(%r8)
callq <memset> ~~~> movq $0x0, 0x18(%r8)
~~~> movq $0x0, 0x10(%r8)
~~~> movq $0x0, 0x8(%r8)
~~~> movq $0x0, (%r8)

In terms of code size, this grows the clang-built kernel a
bit (+0.022%):
440285512 vmlinux.clang.after
440383608 vmlinux.clang.before

Signed-off-by: Clement Courbet <courbet@xxxxxxxxxx>
---
arch/x86/include/asm/string_64.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 75314c3dbe47..7073c25aa4a3 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -18,6 +18,15 @@ extern void *__memcpy(void *to, const void *from, size_t len);
void *memset(void *s, int c, size_t n);
void *__memset(void *s, int c, size_t n);

+/* Recent compilers can generate much better code for known size and/or
+ * fill values, and will fallback on `memset` if they fail.
+ * We alias `memset` to `__builtin_memset` explicitly to inform the compiler to
+ * perform this optimization even when -ffreestanding is used.
+ */
+#if (__GNUC__ >= 4)
+#define memset(s, c, count) __builtin_memset(s, c, count)
+#endif
+
#define __HAVE_ARCH_MEMSET16
static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
{
@@ -74,6 +83,7 @@ int strcmp(const char *cs, const char *ct);
#undef memcpy
#define memcpy(dst, src, len) __memcpy(dst, src, len)
#define memmove(dst, src, len) __memmove(dst, src, len)
+#undef memset
#define memset(s, c, n) __memset(s, c, n)

#ifndef __NO_FORTIFY
--
2.25.1.696.g5e7596f4ac-goog