RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: David Laight
Date: Thu Nov 16 2023 - 17:53:47 EST


From: Linus Torvalds
> Sent: 16 November 2023 17:25
...
> > How much difference does FSRM actually make?
> > Especially when compared to the cost of a function call (even
> > without the horrid return thunk).
>
> It can be a big deal. The subject line here is an example. On that
> machine, using the call to 'memcpy_orig' clearly performs *noticeably*
> better. So that 16% regression was"fast apparently at least partly
> because of
>
> -11.0 perf-profile.self.cycles-pp.memcpy_orig
> +14.7 perf-profile.self.cycles-pp.copy_page_from_iter_atomic
>
> where that inlined copy (that used 'rep movsq' and other things around
> it) was noticeably worse than just calling memcpy_orig that does a
> basic unrolled loop.

Wasn't that the stupid PoS inlined memcpy that was absolutely
horrendous?
I've also not seen any obvious statement about the lengths of the
copies.

> Now, *why* it matters a lot is unclear. Some machines literally have
> the "fast rep string" code disabled, and then "rep movsb" is just
> horrendous. That's arguably a machine setup issue, but people have
> been known to do those things because of problems (most recently
> "reptar").

They get what they deserve :-)

I've just done some measurements on an i7-7700.
cpuinfo:flags has erms but not frms (as I'd expect).

The test code path is:
rdpmc
lfence
then 10 copies of:
mov %r13,%rdi
mov %r14,%rsi
mov %r15,%rcx
rep movsb
followed by:
lfence
rdpmc

which I run through 5 times.
The first pass is cold-cache and expected to be slow.
The other 4 pretty much take the same number of clocks.
(Which is what I've seen before using the same basic program
to time the ip-checksum code.)

At first sight it appears that each 'rep movsb' takes about
32 clocks for short copies and only starts increasing above
(about) 32 bytes - and then increases very slowly.

But something very odd is going on.
For length 1 (to ~32) the first pass is ~4500 clocks and the others ~320.
For longer length the clock count increases slowly.
But length 0 reports ~600 for all 5 passes.

The cache should be the same in both cases.
So the effect must be an artifact of the instruction decoder.
The loop isn't long enough to not fit in the cpu loop buffer
(of decoded u-ops) - could try padding it with lots of nops.

This rather implies that the decode of 'rep movs' is taking
something horrid like 450 clocks, but it gets saved somewhere.
OTOH if the count is zero the decode+execute is only ~60 clocks
but it isn't saved.

If that is true (and I doubt Intel would admit it) you pretty
much never want to use 'rep movs' in any form unless you are
going to execute the instruction multiple times or the
length is significant.

This wasn't the conclusion I expected to come to...

It also means that while 'rep movs' will copy at 16 bytes/clock
(or 32 if the destination is aligned) it is possible that it
will always be slower than a register copy loop (8 bytes/clock)
unless the copy is significantly longer than most of the kernel
memcpy() can ever be.

...
> I have this memory from my "push for -Os" (which is from over a decade
> ago, to take my memory with a pinch of salt) of seeing "rep movsb"
> followed by a load of the result causing a horrid stall on the load.

I added some (unrelated) memory accesses between the 'rep movsb'.
Didn't see any significant delays.

The systems you were using a decade ago were likely very different
to the current ones - especially if they were Intel and pre-dated
sandy bridge.

> A regular load-store loop will have the store data forwarded to any
> subsequent load, but "rep movs" might not do that and if it works on a
> cacheline level you might lose out on those kinds of things.

That probably doesn't matter for data buffer copies.
You are unlikely to access them again that quickly.

> Don't get me wrong - I really like the rep string instructions, and
> while they have issues I'd *love* for CPU's to basically do "memcpy"
> and "memset" without any library call overhead. The security
> mitigations have made indirect calls much worse, but they have made
> regular function call overhead worse too (and there's the I$ footprint
> thing etc etc).
>
> So I like "rep movs" a lot when it works well, but it most definitely
> does not work well everywhere.

Yes, it is a real shame that everything since (probably) the 486
has execute 'rep anything' rather slower than you might expect.

Intel also f*cked up the 'loop' (dec %cx, jnz) instruction.
Even on cpu with adcx and adox you can't use 'loop'.

...
> The problem with code generation at this level is that you win some,
> you lose some. You can seldom make everybody happy.

Trying to second guess a workable model for the x86 cpu is hard.
For arithmetic instructions the register dependency chains seem
to give a reasonable model.
If the code flow doesn't depend on the data then the 'out of order'
execute will process data (from cache) when the relevant memory
instructions finally complete.
So I actually got pretty much the expected timings for my ip-csum
code loops (somewhat better than the current version).

But give me a nice simple cpu like the NiosII soft cpu.
The instruction and local memory timings are absolutely
well defined - and you can look at the fpga internals and
check!

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)