Re: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression

From: Feng Tang
Date: Tue May 09 2023 - 08:30:52 EST


Hi Linus,

On Thu, May 04, 2023 at 11:56:12AM -0700, Linus Torvalds wrote:
> On Thu, May 4, 2023 at 10:39 AM Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > And I suspect that the 'blogbench.read_score' thing probably does
> > mostly big reads, and really really liked the cacheline optimizations
> > of 'rep movs' even when it was slow in other things.
>
> Oh, and I went to look at the blogbench code, and it's a
> *horrifically* bad benchmark for this case.
>
> It may actually be a good benchmark for other things - the "look up
> random filenames" etc might be somewhat indicative of real performance
> on some loads.
>
> But the read() system call part is very much *not* good.
>
> It looks like the reason blogbench likes 'rep movs' is the traditional
> reason why memory copy benchmarks tend to like uncached copies: it
> looks like the benchmark doesn't actually even look at the data it
> reads.
>
> It's a common pattern, but it's a bad pattern. It's not very different
> from doing CPU benchmarks using an empty loop:
>
> for (int i = 0; i < 1000000; i++) /*nothing*/;
>
> and then being surprised when the compiler optimized the loop away
> because it doesn't actually *do* anything.
>
> Of course, unlike the empty loop, the kernel will still do the reads,
> but the kernel will do the reads with the assumption that the result
> matters.
>
> And for blogbench, that is *doubly* not the case.
>
> Because it looks like blogbench does a loop over
>
> read_dummy_file(file_name);
>
> which will open that file, and then loop over
>
> #define READ_CHUNK_SIZE 65536
> ...
> static char garbage[READ_CHUNK_SIZE];
> ..
> loop:
> if ((readen = read(fd, garbage, sizeof garbage))
>
> and that variable is very appropriately named indeed. It is very much garbage.
>
> So it really looks like what is going on is that you have
>
> (a) a threaded application (looks like N=100 for reads by default)
>
> (b) doing 'read()' calls into a *shared* buffer in parallel
>
> (c) and never using the buffer for anything else
>
> which means that a memory copy that does non-temporal writes is going
> to look artificially really good, because nobody wants to ever see the
> end result.
>
> Now, the "nobody wants to ever see the end result", might actually be
> at least _partially_ valid in some historical setting, if you were to
> use DMA to then push things out to a network. Back in the olden days,
> that was actually a valid reason to use non-temporal writes, because
> the DMA would have to invalidate the caches anyway.
>
> So that part is wrong, but isn't necessarily *horrifically* wrong.
> It's just bad.
>
> But using one shared destination buffer for N readers is *completely*
> bogus. Any cache-lines would bounce around like crazy for the
> (pointless) sharing. I'd like to call it "false sharing", but it's not
> really false: it's intentionally just using one big shared buffer as a
> sink for all these read() calls.
>
> End result: I think the benchmark scores for reading are entirely
> pointless random numbers, and trying to optimize the kernel for this
> benchmark is exactly the wrong thing to do.
>
> The benchmark doesn't actually do anything half-way valid.
>
> At a *minimum* that benchmark should have different threads using
> different read() buffers.

We tried a debug patch which allocates a dedicated buffer for
each reader thread, run it on the same Cacade Lake platform, and
the regression is _gone_, after the noise of cache false sharing is
reduced.

20f3337d350c4e1b adfcf4231b8cbc2d9c1e7bfaa96
---------------- ---------------------------
2119572 ± 2% +1.7% 2155704 ± 2% blogbench.read_score

3700996 ± 9% -7.4% 3427562 ± 16% perf-stat.ps.dTLB-load-misses
8.171e+09 +25.3% 1.024e+10 perf-stat.ps.dTLB-loads
431224 ± 9% -13.1% 374903 ± 12% perf-stat.ps.dTLB-store-misses
2.399e+09 +87.5% 4.497e+09 perf-stat.ps.dTLB-stores
8495507 ± 7% -8.2% 7794738 perf-stat.ps.iTLB-load-misses

3.07 -3.1 0.00 perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
0.00 +3.4 3.40 perf-profile.self.cycles-pp.copy_user_generic_unrolled

We also run the original blogbench binary on a IceLake platform
(which has both ERMS and FSRM), and there is almost no peformance
change for the 2 commits.

Please let us know if you need more profiling info or want to run
more tests.

Thanks,
Feng

---
The debug patch for blobbench is:

diff --git a/src/blogbench.h b/src/blogbench.h
index bf538a8..d99d5bb 100644
--- a/src/blogbench.h
+++ b/src/blogbench.h
@@ -178,6 +178,8 @@ void *rewriter(void * const fodder);
void *reader(void * const fodder);
void *commenter(void * const fodder);

+extern pthread_key_t read_buf_key;
+
#include "globals.h"

#endif
diff --git a/src/helpers.c b/src/helpers.c
index 4c17e73..85631ec 100644
--- a/src/helpers.c
+++ b/src/helpers.c
@@ -107,15 +107,18 @@ int create_atomic_file(const char * const file_name,

int read_dummy_file(const char * const file_name)
{
- static char garbage[READ_CHUNK_SIZE];
ssize_t readen;
+ void * read_buf;
int fd;

if ((fd = open(file_name, O_RDONLY)) == -1) {
return -1;
}
+
+ read_buf = pthread_getspecific(read_buf_key);
+
do {
- if ((readen = read(fd, garbage, sizeof garbage))
+ if ((readen = read(fd, read_buf, READ_CHUNK_SIZE))
< (ssize_t) 0) {
if (errno == EINTR) {
continue;
diff --git a/src/process.c b/src/process.c
index a83a980..04f1411 100644
--- a/src/process.c
+++ b/src/process.c
@@ -58,9 +58,13 @@ static int wait_rewriters(void)
return 0;
}

+pthread_key_t read_buf_key = 0;
+
static int spawn_readers(void)
{
unsigned int t = nb_readers;
+
+ pthread_key_create(&read_buf_key, NULL);

do {
t--;
diff --git a/src/reader.c b/src/reader.c
index 91bab7a..2c0efdb 100644
--- a/src/reader.c
+++ b/src/reader.c
@@ -66,6 +66,11 @@ static int read_random_blog(void)
void *reader(void * const fodder)
{
(void) fodder;
+
+ void *read_buf;
+
+ read_buf = malloc(READ_CHUNK_SIZE);
+ pthread_setspecific(read_buf_key, read_buf);

do {
if (read_random_blog() != 0) {
@@ -75,6 +80,8 @@ void *reader(void * const fodder)
usleep(USLEEP_READER);
#endif
} while (stop == (sig_atomic_t) 0);
+
+ free(read_buf);

return NULL;
}

> I suspect it might be a good idea to also touch some of the data,
> because that would be the normal situation (even with old-style
> zero-copy DMA you'd probably not do raw file data and push it out to
> the network unmodified).
>
> Because otherwise you will always find that objectively bad memory
> copies will do better than a good memory copy that makes sense.
>
> Anyway, that was a very long-winded way of saying that I will not use
> that benchmark as a reason to touch the kernel "copy_to_user()"
> implementation.
>
> Of course, that's not to say that this might not be a real regression
> on real loads, but I'd want to see those other numbers.
>
> It looks like this is an old benchmark that hasn't been touched in
> years (just going by the github activity), but I'm cc'ing the author
> anyway, and will add a pointer to this email as a github issue. Maybe
> the developer cares, maybe he doesn't, but no reason not to at least
> mention this and see if maybe the benchmark can be improved to at
> least use a thread buffers for the reads.
>
> (The same issues are true for the writing side too, of course, but the
> writing side has less parallelism and a shared *source* is not the
> same kind of problematic from a cache pattern standpoint, so probably
> isn't going to have anything like the same effect from any memcpy
> implementation).
>
> Linus