Re: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression

From: Linus Torvalds
Date: Thu May 04 2023 - 14:56:37 EST


On Thu, May 4, 2023 at 10:39 AM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> And I suspect that the 'blogbench.read_score' thing probably does
> mostly big reads, and really really liked the cacheline optimizations
> of 'rep movs' even when it was slow in other things.

Oh, and I went to look at the blogbench code, and it's a
*horrifically* bad benchmark for this case.

It may actually be a good benchmark for other things - the "look up
random filenames" etc might be somewhat indicative of real performance
on some loads.

But the read() system call part is very much *not* good.

It looks like the reason blogbench likes 'rep movs' is the traditional
reason why memory copy benchmarks tend to like uncached copies: it
looks like the benchmark doesn't actually even look at the data it
reads.

It's a common pattern, but it's a bad pattern. It's not very different
from doing CPU benchmarks using an empty loop:

for (int i = 0; i < 1000000; i++) /*nothing*/;

and then being surprised when the compiler optimized the loop away
because it doesn't actually *do* anything.

Of course, unlike the empty loop, the kernel will still do the reads,
but the kernel will do the reads with the assumption that the result
matters.

And for blogbench, that is *doubly* not the case.

Because it looks like blogbench does a loop over

read_dummy_file(file_name);

which will open that file, and then loop over

#define READ_CHUNK_SIZE 65536
...
static char garbage[READ_CHUNK_SIZE];
..
loop:
if ((readen = read(fd, garbage, sizeof garbage))

and that variable is very appropriately named indeed. It is very much garbage.

So it really looks like what is going on is that you have

(a) a threaded application (looks like N=100 for reads by default)

(b) doing 'read()' calls into a *shared* buffer in parallel

(c) and never using the buffer for anything else

which means that a memory copy that does non-temporal writes is going
to look artificially really good, because nobody wants to ever see the
end result.

Now, the "nobody wants to ever see the end result", might actually be
at least _partially_ valid in some historical setting, if you were to
use DMA to then push things out to a network. Back in the olden days,
that was actually a valid reason to use non-temporal writes, because
the DMA would have to invalidate the caches anyway.

So that part is wrong, but isn't necessarily *horrifically* wrong.
It's just bad.

But using one shared destination buffer for N readers is *completely*
bogus. Any cache-lines would bounce around like crazy for the
(pointless) sharing. I'd like to call it "false sharing", but it's not
really false: it's intentionally just using one big shared buffer as a
sink for all these read() calls.

End result: I think the benchmark scores for reading are entirely
pointless random numbers, and trying to optimize the kernel for this
benchmark is exactly the wrong thing to do.

The benchmark doesn't actually do anything half-way valid.

At a *minimum* that benchmark should have different threads using
different read() buffers.

I suspect it might be a good idea to also touch some of the data,
because that would be the normal situation (even with old-style
zero-copy DMA you'd probably not do raw file data and push it out to
the network unmodified).

Because otherwise you will always find that objectively bad memory
copies will do better than a good memory copy that makes sense.

Anyway, that was a very long-winded way of saying that I will not use
that benchmark as a reason to touch the kernel "copy_to_user()"
implementation.

Of course, that's not to say that this might not be a real regression
on real loads, but I'd want to see those other numbers.

It looks like this is an old benchmark that hasn't been touched in
years (just going by the github activity), but I'm cc'ing the author
anyway, and will add a pointer to this email as a github issue. Maybe
the developer cares, maybe he doesn't, but no reason not to at least
mention this and see if maybe the benchmark can be improved to at
least use a thread buffers for the reads.

(The same issues are true for the writing side too, of course, but the
writing side has less parallelism and a shared *source* is not the
same kind of problematic from a cache pattern standpoint, so probably
isn't going to have anything like the same effect from any memcpy
implementation).

Linus