Re: [PATCH v3] x86, mem: move memmove to out of line assembler

From: Nick Desaulniers
Date: Wed Sep 28 2022 - 15:07:01 EST


On Wed, Sep 28, 2022 at 12:24 AM Rasmus Villemoes
<linux@xxxxxxxxxxxxxxxxxx> wrote:
>
> On 27/09/2022 23.02, Nick Desaulniers wrote:
>
> > + /* Decide forward/backward copy mode */
> > + cmpl dest, src
> > + jb .Lbackwards_header
>
> I know you're mostly just moving existing code, but for my own education
> I'd like to understand this.
>
> > + /*
> > + * movs instruction have many startup latency
> > + * so we handle small size by general register.
> > + */
> > + cmpl $680, n
> > + jb .Ltoo_small_forwards
>
> OK, this I get, there's some overhead, and hence we need _some_ cutoff
> value; 680 is probably chosen by some trial-and-error, but the exact
> value likely doesn't matter too much.

__memmove in arch/x86/lib/memmove_64.S uses the same value.
But I assume this is precisely why FSRM was created.
https://www.phoronix.com/news/Intel-5.6-FSRM-Memmove
commit f444a5ff95dc ("x86/cpufeatures: Add support for fast short REP; MOVSB")

>
> > + /*
> > + * movs instruction is only good for aligned case.
> > + */
> > + movl src, tmp0
> > + xorl dest, tmp0
> > + andl $0xff, tmp0
> > + jz .Lforward_movs
>
> But this part I don't understand at all. This checks that the src and
> dest have the same %256 value, which is a rather odd thing, and very
> unlikely to ever be hit in practice. I could understand if it checked
> that they were both 4 or 8 or 16-byte aligned (i.e., (src|dest)&FOO)),
> or if it checked that they had the same offset within a cacheline [say
> (src^dest)&0x3f].
>
> Any idea where that comes from? Or am I just incapable of reading x86 asm?

So I think the above is roughly:
if ((src ^ dest) & 0xff == 0)
goto .Lforward_movs;

So if src or dest don't have the same bottom byte value, don't use movs?

>
> > +.Ltoo_small_forwards:
> > + subl $0x10, n
> > +
> > + /*
> > + * We gobble 16 bytes forward in each loop.
> > + */
> > +.L16_byteswap_forwards_loop:
> > + subl $0x10, n
> > + movl 0*4(src), tmp0
> > + movl 1*4(src), tmp1
> > + movl tmp0, 0*4(dest)
> > + movl tmp1, 1*4(dest)
> > + movl 2*4(src), tmp0
> > + movl 3*4(src), tmp1
> > + movl tmp0, 2*4(dest)
> > + movl tmp1, 3*4(dest)
> > + leal 0x10(src), src
> > + leal 0x10(dest), dest
> > + jae .L16_byteswap_forwards_loop
> > + addl $0x10, n
> > + jmp .L16_byteswap
> > +
> > + /*
> > + * Handle data forward by movs.
> > + */
> > +.p2align 4
> > +.Lforward_movs:
> > + movl -4(src, n), tmp0
> > + leal -4(dest, n), tmp1
> > + shrl $2, n
> > + rep movsl
> > + movl tmp0, (tmp1)
> > + jmp .Ldone
>
> So in the original code, %1 was forced to be %esi and %2 was forced to
> be %edi and they were initialized by src and dest. But here I fail to
> see how those registers have been properly set up before the rep movs;
> your names for those are tmp0 and tmp2. You have just loaded the last
> word of the source to %edi, and AFAICT %esi aka tmp2 is entirely
> uninitialized at this point (the only use is in L16_byteswap).
>
> I must be missing something. Please enlighten me.

No, you're right. It looks like rep movsl needs src in %esi and dest
needs to be in %edi, so I can't reuse the input registers from
-mregparm=3; a pair of movs is required. A v4 is required.

Probably should write a test for memcpy where n > magic constant 680.
--
Thanks,
~Nick Desaulniers