RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: David Laight
Date: Thu Nov 16 2023 - 11:56:01 EST


From: Linus Torvalds
> Sent: 16 November 2023 03:27
>
> On Wed, 15 Nov 2023 at 18:00, David Howells <dhowells@xxxxxxxxxx> wrote:
...
> > A disassembly of _copy_from_iter() for the latter is attached. Note that the
> > UBUF/IOVEC still uses "rep movsb"
>
> Well, yes and no.
>
> User copies do that X86_FEATURE_FSRM alternatives dance, so the code
> gets generated with "rep movs", but you'll note that there are several
> 'nops' after it.
>
> Some of the nops are because we'll be inserting STAC/CLAC (three bytes
> each, I think) instructions around user accesses for SMAP-capable
> CPU's.
>
> But some of the nops are because we'll be rewriting that "rep stosb"
> (two bytes, iirc) as "call rep_stos_alternative" (5 bytes) on CPU's
> that don't do FSRM like yours. So your CPU won't actually be executing
> that 'rep stosb' sequence.

I presume lack of coffee is responsible for the s/movs/stos/ :-)

How much difference does FSRM actually make?
Especially when compared to the cost of a function call (even
without the horrid return thunk).

For small %cx I think non-FSRM modern cpu are ~2 clocks/byte
(no fixed overhead).
Which means 'rep movsb' wins for both short and long copies.
I wonder what sizes the function call (with all its size
based compares at the top) is actually a win.

There has to be some mileage in getting the complier to generate
'call memcpy' (for non-constant sizes) and then run-time patching
the 5 byte 'call offset' into 'mov %edx,%ecx; rep movsb'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)