Re: [PATCH] x86: Alias memset to __builtin_memset.

From: Nathan Chancellor
Date: Mon Mar 23 2020 - 11:47:25 EST


On Mon, Mar 23, 2020 at 12:42:06PM +0100, 'Clement Courbet' via Clang Built Linux wrote:
> Recent compilers know the meaning of builtins (`memset`,
> `memcpy`, ...) and can replace calls by inline code when
> deemed better. For example, `memset(p, 0, 4)` will be lowered
> to a four-byte zero store.
>
> When using -ffreestanding (this is the case e.g. building on
> clang), these optimizations are disabled. This means that **all**
> memsets, including those with small, constant sizes, will result
> in an actual call to memset.
>
> We have identified several spots where we have high CPU usage
> because of this. For example, a single one of these memsets is
> responsible for about 0.3% of our total CPU usage in the kernel.
>
> Aliasing `memset` to `__builtin_memset` allows the compiler to
> perform this optimization even when -ffreestanding is used.
> There is no change when -ffreestanding is not used.
>
> Below is a diff (clang) for `update_sg_lb_stats()`, which
> includes the aforementionned hot memset:
> memset(sgs, 0, sizeof(*sgs));
>
> Diff:
> movq %rsi, %rbx ~~~> movq $0x0, 0x40(%r8)
> movq %rdi, %r15 ~~~> movq $0x0, 0x38(%r8)
> movl $0x48, %edx ~~~> movq $0x0, 0x30(%r8)
> movq %r8, %rdi ~~~> movq $0x0, 0x28(%r8)
> xorl %esi, %esi ~~~> movq $0x0, 0x20(%r8)
> callq <memset> ~~~> movq $0x0, 0x18(%r8)
> ~~~> movq $0x0, 0x10(%r8)
> ~~~> movq $0x0, 0x8(%r8)
> ~~~> movq $0x0, (%r8)
>
> In terms of code size, this grows the clang-built kernel a
> bit (+0.022%):
> 440285512 vmlinux.clang.after
> 440383608 vmlinux.clang.before
>
> Signed-off-by: Clement Courbet <courbet@xxxxxxxxxx>
> ---
> arch/x86/include/asm/string_64.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> index 75314c3dbe47..7073c25aa4a3 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -18,6 +18,15 @@ extern void *__memcpy(void *to, const void *from, size_t len);
> void *memset(void *s, int c, size_t n);
> void *__memset(void *s, int c, size_t n);
>
> +/* Recent compilers can generate much better code for known size and/or
> + * fill values, and will fallback on `memset` if they fail.
> + * We alias `memset` to `__builtin_memset` explicitly to inform the compiler to
> + * perform this optimization even when -ffreestanding is used.
> + */
> +#if (__GNUC__ >= 4)

This ifdef is unnecessary, this will always be true because the minimum
supported GCC version is 4.6 and clang pretends it is 4.2.1. It appears
the Intel compiler will pretend to be whatever whatever GCC version the
host is using (no idea if it is still used by anyone or truly supported
but still worth mentioning) so it would still always be true because of
the GCC 4.6 requirement.

I cannot comment on the validity of the patch even though the reasoning
seems sound to me.

Cheers,
Nathan

> +#define memset(s, c, count) __builtin_memset(s, c, count)
> +#endif
> +
> #define __HAVE_ARCH_MEMSET16
> static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
> {
> @@ -74,6 +83,7 @@ int strcmp(const char *cs, const char *ct);
> #undef memcpy
> #define memcpy(dst, src, len) __memcpy(dst, src, len)
> #define memmove(dst, src, len) __memmove(dst, src, len)
> +#undef memset
> #define memset(s, c, n) __memset(s, c, n)
>
> #ifndef __NO_FORTIFY
> --
> 2.25.1.696.g5e7596f4ac-goog
>