RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: David Laight
Date: Fri Nov 17 2023 - 10:20:33 EST


From: Linus Torvalds
> Sent: 17 November 2023 13:36
>
> On Fri, 17 Nov 2023 at 08:09, David Laight <David.Laight@xxxxxxxxxx> wrote:
> >
> > Zero length copies are different, they always take ~60 clocks.
>
> That zero-length thing is some odd microcode implementation issue, and
> I think intel actually made a FZRM cpuid bit available for it ("Fast
> Zero-size Rep Movs").
>
> I don't think we care in the kernel, but somebody else did (or maybe
> Intel added a flag for "we fixed it" just because they noticed)

I wasn't really worried about it - but it was an oddidy.

> I at some point did some profiling, and we do have zero-length memcpy
> cases occasionally (at least for user copies, which was what I was
> looking at), but they aren't common enough to worry about some small
> extra strange overhead.

For user copies avoiding the slac/stac might make it worthwhile.
But I doubt you'd want to add the 'jcxz .+n' in the copy code
itself because the mispredicted branch might make a bigger
difference.

I have tested writev() with lots of zero length fragments.
But that isn't a normal case.

> (In case you care, it was for things like an ioctl doing "copy the
> base part of the ioctl data, then copy the rest separately". Where
> "the rest" was then often nothing at all).

That specific code where a zero length copy is quite likely
would probably benefit from a test in the source.

> > My current guess for the 5000 clocks is that the logic to
> > decode 'rep movsb' is loaded into a buffer that is also used
> > to decode some other instructions.
>
> Unlikely.
>
> I would guess it's the "power up the AVX2 side". The memory copy uses
> those same resources internally.

That would make more sense - and have much the same effect.
If the kernel used 'rep movsb' internally and for user copies
it pretty much wouldn't ever get powered down.

> You could try to see if "first AVX memory access" (or similar) has the
> same extra initial cpu cycle issue.

Spot on.
vpbroadcast %xmm1,%xmm2
does the trick as well.

> Anyway, the CPU you are testing is new enough to have ERMS - that's
> the "we do pretty well on string instructions" flag. It does indeed do
> pretty well on string instructions, but has a few oddities in addition
> to the zero-sized thing.

From what I looked at pretty much everything anyone cares about
probably has ERMS.
You need to be running on something older than sandy bridge.
So basically 'core 2' or 'core 2 duo' (or P4 netburst).
The amd cpus are similarly old.

> The other bad cases tend to be along the line of "it falls flat on its
> face when the source and destination address are not mutually aligned,
> but they are the same virtual address modulo 4096".

There is a similar condition that very often stop the cpu ever
actually doing two memory reads in one clock.
Could easily be related.

> Or something like that. I forget the exact details. The details do
> exist, but I forget where (I suspect either Agner Fog or some footnote
> in some Intel architecture manual).

If Intel have published it, it will be in an unlit basement
behind a locked door and a broken staircase!

Unless 'page copy' hits it I wonder if it really matters
for a normal workload.
Yes, you can conspire to hit it, but mostly you wont.

Wasn't it one of the atoms where the data cache prefetch
managed to completely destroy forwards data copy.
To the point where is was worth taking the hit of a
backwards copy?

> So it's very much not as simple as "fixed initial cost and then a
> fairly fixed cost per 32B", even if that is *one* pattern.

True, but it is the most common one.
And if it is bad the whole thing isn't worth using at all.

I'll try my test on a ivy bridge later.
(I don't have anything older that actually boots.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)