Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: Linus Torvalds
Date: Thu Nov 16 2023 - 12:25:23 EST

Next message: Conor Dooley: "Re: [PATCH v4 2/2] dt-bindings: perf: starfive: Add JH8100 StarLink PMU"
Previous message: Nathan Chancellor: "Re: [PATCH v2 1/1] ARM: kprobes: Explicitly reserve r7 for local variables"
In reply to: David Laight: "RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression"
Next in thread: David Laight: "RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 16 Nov 2023 at 11:55, David Laight <David.Laight@xxxxxxxxxx> wrote:
>
> I presume lack of coffee is responsible for the s/movs/stos/ :-)

Yes.

> How much difference does FSRM actually make?
> Especially when compared to the cost of a function call (even
> without the horrid return thunk).

It can be a big deal. The subject line here is an example. On that
machine, using the call to 'memcpy_orig' clearly performs *noticeably*
better. So that 16% regression was"fast apparently at least partly
because of

-11.0 perf-profile.self.cycles-pp.memcpy_orig
+14.7 perf-profile.self.cycles-pp.copy_page_from_iter_atomic

where that inlined copy (that used 'rep movsq' and other things around
it) was noticeably worse than just calling memcpy_orig that does a
basic unrolled loop.

Now, *why* it matters a lot is unclear. Some machines literally have
the "fast rep string" code disabled, and then "rep movsb" is just
horrendous. That's arguably a machine setup issue, but people have
been known to do those things because of problems (most recently
"reptar").

And in most older microarchitectures it's not just the cycles in the
repat thing, it is also a pipeline stall and I think it's also a
(partial? full?) barrier for OoO execution. That pipeline stall was
most noticeable on P4, but it's most definitely there on other cores
too.

And the OoO execution batter can mean that it *benchmarks* fairly well
when you just do "rep movs" in a loop to test, but then if you have
code *around* it, it causes problems for the instructions around it.

I have this memory from my "push for -Os" (which is from over a decade
ago, to take my memory with a pinch of salt) of seeing "rep movsb"
followed by a load of the result causing a horrid stall on the load.

A regular load-store loop will have the store data forwarded to any
subsequent load, but "rep movs" might not do that and if it works on a
cacheline level you might lose out on those kinds of things.

Don't get me wrong - I really like the rep string instructions, and
while they have issues I'd *love* for CPU's to basically do "memcpy"
and "memset" without any library call overhead. The security
mitigations have made indirect calls much worse, but they have made
regular function call overhead worse too (and there's the I$ footprint
thing etc etc).

So I like "rep movs" a lot when it works well, but it most definitely
does not work well everywhere.

Of course, while the kernel test robot doesn't seem to like the
inlined "rep movsq", clearly the machine David is on absolutely
*hates* the call to memcpy_orig. Possibly due to mitigation overhead.

The problem with code generation at this level is that you win some,
you lose some. You can seldom make everybody happy.

Linus

Next message: Conor Dooley: "Re: [PATCH v4 2/2] dt-bindings: perf: starfive: Add JH8100 StarLink PMU"
Previous message: Nathan Chancellor: "Re: [PATCH v2 1/1] ARM: kprobes: Explicitly reserve r7 for local variables"
In reply to: David Laight: "RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression"
Next in thread: David Laight: "RE: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]