Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: Linus Torvalds
Date: Wed Nov 15 2023 - 12:39:37 EST


On Wed, 15 Nov 2023 at 11:53, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> I wonder if gcc somehow decided to inline "memcpy()" in
> memcpy_from_iter() as a "rep movsb" because of other inlining changes?
>
> [ Goes out to look ]
>
> Yup, I think that's exactly what happened. Gcc seems to decide that it
> might be a small memcpy(), and seems to do at least part of it
> directly.
>
> So I *think* this all is mainly an artifact of gcc having changed code
> generation due to the code re-organization.

The gcc code generation here is *really* odd. I've never seen this
before, so it may be new to newer versions of gcc. I see code like
this:

# lib/iov_iter.c:73: memcpy(to + progress, iter_from, len);
cmpl $8, %edx #, _88
jb .L400 #,
movq (%rsi), %rax #, tmp288
movq %rax, (%rcx) # tmp288,
movl %edx, %eax # _88, _88
movq -8(%rsi,%rax), %rdi #, tmp295
movq %rdi, -8(%rcx,%rax) # tmp295,
leaq 8(%rcx), %rdi #, tmp296
andq $-8, %rdi #, tmp296
subq %rdi, %rcx # tmp296, tmp268
subq %rcx, %rsi # tmp268, tmp269
addl %edx, %ecx # _88, _88
shrl $3, %ecx #,
rep movsq
jmp .L392 #

.L398:
# lib/iov_iter.c:73: memcpy(to + progress, iter_from, len);
movl (%rsi), %eax #, tmp271
movl %eax, (%rcx) # tmp271,
movl %edx, %eax # _88, _88
movl -4(%rsi,%rax), %esi #, tmp278
movl %esi, -4(%rcx,%rax) # tmp278,
movl 8(%r9), %edi # p_72->bv_len, p_72->bv_len
jmp .L330 #
...

.L400:
# lib/iov_iter.c:73: memcpy(to + progress, iter_from, len);
testb $4, %dl #, _88
jne .L398 #,
testl %edx, %edx # _88
je .L330 #,
movzbl (%rsi), %eax #, tmp279
movb %al, (%rcx) # tmp279,
testb $2, %dl #, _88
jne .L390 #,
...

which makes *zero* sense. It first checks that the the length is at
least 8 bytes, then it moves *one* word by hand, then it aligns the
code to 8 bytes remaining, and does the remaining (possibly
overlapping at the beginning) words as one "rep movsq",

And L398 is the "I have 4..7 bytes to copy" target.

And L400 seems to be "I have 0..7 bytes to copy".

This is literally insane. And it seems to be all just gcc having for
some reason decided to do this instead of "rep movsb" or calling an
out-of-line function.

I get the feeling that this is related to how your patches made that
function be an inline function that is inlined through a function
pointer. I suspect that what happens is that gcc expands the memcpy()
first into that inlined function (without caller context), and then
inserts the crazily expanded inline later into the context of that
function pointer.

I dunno. I really only say that because I haven't seen gcc make this
kind of mess before, and that "inlined through a function pointer" is
the main unusual thing here.

How very annoying.

Linus