Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

From: Linus Torvalds
Date: Fri Nov 17 2023 - 11:32:55 EST

Next message: Andrew Davis: "[PATCH 1/2] arm64: dts: ti: k3-am65: Enable SDHCI nodes at the board level"
Previous message: Alexander Gordeev: "Re: [PATCH v3 0/3] s390/vfio-ap: a couple of corrections to the IRQ enablement function"
In reply to: Borislav Petkov: "Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression"
Next in thread: Linus Torvalds: "Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 17 Nov 2023 at 11:10, Borislav Petkov <bp@xxxxxxxxx> wrote:
>
> Which looks like those added memcpy calls add a lot of overhead due to
> the mitigations crap.

No, you missed where I thought that too and asked David to test
without mitigations.

That load really loves "rep movsb" on his machine (and that includes
the gcc-generated odd inlined "one word by hand, and then 'rep movsq'
for the rest").

It's probably because it's a benchmark that doesn't actually touch the
data, and does page-sized copies. It's pretty much the optimal case
for ERMS.

The "do one word by hand, the rest with 'rep movsq'" model that gcc
uses (but only in this particular code generation case) probably ends
up being quite reasonable in general - the one word by hand allows for
unaligned counts, but it also brings in the beginning of the copy into
the cache (which is *often* the part used later - not in this
benchmark, but in general), and then the rest ends up being done as L2
cacheline copies at least when we have those nice page-aligned
patterns.

Of course, this whole thread started because the kernel test robot
then has exactly the opposite reaction - it seems to really *hate*
that inlined code generation by gcc. Whether that is because it's a
very different microarchitecture, or it's because it's just a very
different access pattern than the one that David's random KUnit test
pattern is, I don't know.

That kernel test robot case is on a Cooper Lake Xeon, which is (I
think) is just Skylake server. Random Intel codenames...

So that test robot has ERMS too, but no FSRM, so we do the old
"memcpy_orig()" with the regular memcpy loop.

And on that Xeon, it really does seem to be the right thing to do.

But the profile is so noisy with other changes that it's not like I
can guarantee that that is the main issue here. The reason I zeroed in
on the memcpy thing was really just that (a) it does show up in the
profiles and (b) the commit that introduced that 16% regression
doesn't really seem to do anything else than reorganize things just
enough that gcc seems to do that alternate memcpy implementation.

The test case seems to be (from the profile) just a simple

do_iter_readv_writev ->
shmem_file_write_iter ->
generic_perform_write ->
copy_page_from_iter_atomic ->
memcpy_from_iter_mc

and that's then where the new code generation matters (ie does it do
that "inline with rep movsq" or "call memcpy_orig").

For David, the rep movsq is great. For the kernel test robot, it's bad.

Linus

Next message: Andrew Davis: "[PATCH 1/2] arm64: dts: ti: k3-am65: Enable SDHCI nodes at the board level"
Previous message: Alexander Gordeev: "Re: [PATCH v3 0/3] s390/vfio-ap: a couple of corrections to the IRQ enablement function"
In reply to: Borislav Petkov: "Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression"
Next in thread: Linus Torvalds: "Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]