RE: [PATCH] fs.h: Optimize file struct to prevent false sharing

From: Chen, Zhiyin
Date: Thu Jun 01 2023 - 06:48:08 EST


Good questions.
perf has been applied to analyze the performance. In the syscall test, the patch can
reduce the CPU cycles for filp_close. Besides, the HITM count is also reduced from
43182 to 33146.
The test is not restricted to a set of adjacent cores. The numactl command is only
used to limit the number of processing cores. In most situations, only 8/16/32 CPU
cores are used. Performance improvement is still obvious, even if non-adjacent
CPU cores are used.

No matter what CPU type, cache size, or architecture, false sharing is always
negative on performance. And the read mostly members should be put together.

To further prove the updated layout effectiveness on some other codes path,
results of fsdisk, fsbuffer, and fstime are also shown in the new commit message.

Actually, the new layout can only reduce false sharing in high-contention situations.
The performance gain is not obvious, if there are some other bottlenecks. For
instance, if the cores are spread across multiple sockets, memory access may be
the new bottleneck due to NUMA.

Here are the results across NUMA nodes. The patch has no negative effect on the
performance result.

Command: numactl -C 0-3,16-19,63-66,72-75 ./Run -c 16 syscall fstime fsdisk fsbuffer
With Patch
Benchmark Run: Thu Jun 01 2023 03:13:52 - 03:23:15
224 CPUs in system; running 16 parallel copies of tests

File Copy 1024 bufsize 2000 maxblocks 589958.6 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 148779.2 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 1968023.8 KBps (30.0 s, 2 samples)
System Call Overhead 5804316.1 lps (10.0 s, 7 samples)

System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 589958.6 1489.8
File Copy 256 bufsize 500 maxblocks 1655.0 148779.2 899.0
File Copy 4096 bufsize 8000 maxblocks 5800.0 1968023.8 3393.1
System Call Overhead 15000.0 5804316.1 3869.5
========
System Benchmarks Index Score (Partial Only) 2047.8

Without Patch
Benchmark Run: Thu Jun 01 2023 02:11:45 - 02:21:08
224 CPUs in system; running 16 parallel copies of tests

File Copy 1024 bufsize 2000 maxblocks 571829.9 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 147693.8 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 1938854.5 KBps (30.0 s, 2 samples)
System Call Overhead 5791936.3 lps (10.0 s, 7 samples)

System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 571829.9 1444.0
File Copy 256 bufsize 500 maxblocks 1655.0 147693.8 892.4
File Copy 4096 bufsize 8000 maxblocks 5800.0 1938854.5 3342.9
System Call Overhead 15000.0 5791936.3 3861.3
========
System Benchmarks Index Score (Partial Only) 2019.5

> -----Original Message-----
> From: Dave Chinner <david@xxxxxxxxxxxxx>
> Sent: Thursday, June 1, 2023 6:31 AM
> To: Chen, Zhiyin <zhiyin.chen@xxxxxxxxx>
> Cc: Eric Biggers <ebiggers@xxxxxxxxxx>; Christian Brauner
> <brauner@xxxxxxxxxx>; viro@xxxxxxxxxxxxxxxxxx; linux-
> fsdevel@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Zou, Nanhai
> <nanhai.zou@xxxxxxxxx>; Feng, Xiaotian <xiaotian.feng@xxxxxxxxx>
> Subject: Re: [PATCH] fs.h: Optimize file struct to prevent false sharing
>
> On Wed, May 31, 2023 at 10:31:09AM +0000, Chen, Zhiyin wrote:
> > As Eric said, CONFIG_RANDSTRUCT_NONE is set in the default config and
> > some production environments, including Ali Cloud. Therefore, it is
> > worthful to optimize the file struct layout.
> >
> > Here are the syscall test results of unixbench.
>
> Results look good, but the devil is in the detail....
>
> > Command: numactl -C 3-18 ./Run -c 16 syscall
>
> So the test is restricted to a set of adjacent cores within a single CPU socket,
> so all the cachelines are typically being shared within a single socket's CPU
> caches. IOWs, the fact there are 224 CPUs in the machine is largely irrelevant
> for this microbenchmark.
>
> i.e. is this a microbenchmark that is going faster simply because the working
> set for the specific benchmark now fits in L2 or L3 cache when it didn't before?
>
> Does this same result occur for different CPUs types, cache sizes and
> architectures? What about when the cores used by the benchmark are
> spread across mulitple sockets so the cost of remote cacheline access is taken
> into account? If this is actually a real benefit, then we should see similar or
> even larger gains between CPU cores that are further apart because the cost
> of false cacheline sharing are higher in those systems....
>
> > Without patch
> > ------------------------
> > 224 CPUs in system; running 16 parallel copies of tests
> > System Call Overhead 5611223.7 lps (10.0 s, 7 samples)
> > System Benchmarks Partial Index BASELINE RESULT INDEX
> > System Call Overhead 15000.0 5611223.7 3740.8
> > ========
> > System Benchmarks Index Score (Partial Only) 3740.8
> >
> > With patch
> > ----------------------------------------------------------------------
> > --
> > 224 CPUs in system; running 16 parallel copies of tests
> > System Call Overhead 7567076.6 lps (10.0 s, 7 samples)
> > System Benchmarks Partial Index BASELINE RESULT INDEX
> > System Call Overhead 15000.0 7567076.6 5044.7
> > ========
> > System Benchmarks Index Score (Partial Only) 5044.7
>
> Where is all this CPU time being saved? Do you have a profile showing what
> functions in the kernel are running far more efficiently now?
>
> Yes, the results look good, but if all this change is doing is micro-optimising a
> single code path, it's much less impressive and far more likley that it has no
> impact on real-world performance...
>
> More information, please!
>
> -Dave.
>
> --
> Dave Chinner
> david@xxxxxxxxxxxxx