Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: Linus Torvalds
Date: Wed Nov 15 2023 - 14:10:04 EST


On Wed, 15 Nov 2023 at 13:45, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> Do you perhaps have CONFIG_CC_OPTIMIZE_FOR_SIZE set? That makes gcc
> use "rep movsb" - even for small copies that most definitely should
> *not* use "rep movsb".

Just to give some background an an example:

__builtin_memcpy(dst, src, 24);

with -O2 is done as three 64-bit move instructions (well, three in
both direction, so six instructions total), and with -Os you get

movl $6, %ecx
rep movsl

instead. And no, this isn't all that uncommon, because things like
the above is what happens when you copy a small structure around.

And that "rep movsl" is indeed nice and small, but it's truly
horrendously bad from a performance angle on most cores, compared to
the six instructions that can schedule nicely and take a cycle or two.

There are some other cases of similar "-Os generates unacceptable
code". For example, dividing by a constant - when you use -Os, gcc
thinks that it's perfectly fine to actually generate a divide
instruction, because it is indeed small.

But in most cases you really *really* want to use a "multiply by
reciprocal" even though it generates bigger code. Again, it ends up
depending on microarchitecture, and modern cores tend to do better on
divides, but it's another of those things where saving a copuple of
bytes of code space is not the right choice if it means that you use a
slow divider.

And again, those "divide by constant" often happen in implicit
contexts (ie the constant may be the size of a structure, and the
divide is due to taking a pointer difference). Let's say you have a
structure that isn't a power of two, but is (to pick a random but not
unlikely value) is 56 bytes in size.

The code generation for -O2 is (value in %rdi)

movabsq $2635249153387078803, %rax
shrq $3, %rdi
mulq %rdi

and for -Os you get (value in %rax):

movl $56, %ecx
xorl %edx, %edx
divq %rcx

and that 'divq' is certainly again smaller and more obvious, but again
we're talking "single cycles" vs "potentially 50+ cycles" depending on
uarch.

Linus