Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: Linus Torvalds
Date: Wed Nov 15 2023 - 22:26:59 EST


On Wed, 15 Nov 2023 at 18:00, David Howells <dhowells@xxxxxxxxxx> wrote:
>
> And using __memcpy() rather than memcpy():

Yeah, that's just sad. It might indeed be that you're running on a
Haswell core, and the retpoline overhead just kills that entirely. You
could try building the kernel without mitigations (or booting with
them off, which isn't quite as good) to verify.

> A disassembly of _copy_from_iter() for the latter is attached. Note that the
> UBUF/IOVEC still uses "rep movsb"

Well, yes and no.

User copies do that X86_FEATURE_FSRM alternatives dance, so the code
gets generated with "rep movs", but you'll note that there are several
'nops' after it.

Some of the nops are because we'll be inserting STAC/CLAC (three bytes
each, I think) instructions around user accesses for SMAP-capable
CPU's.

But some of the nops are because we'll be rewriting that "rep stosb"
(two bytes, iirc) as "call rep_stos_alternative" (5 bytes) on CPU's
that don't do FSRM like yours. So your CPU won't actually be executing
that 'rep stosb' sequence.

And yes, the '__x86_return_thunk' overhead can be pretty horrific. It
will get rewritten to the appropriate thing by "apply_returns". But
like the "rep movs" and the missing STAC/CLAC, you won't see that in
the objdump, you only see it in the final binary.

Linus