RE: [PATCH] fs/address_space: add alignment padding for i_map and i_mmap_rwsem to mitigate a false sharing.

From: Zhu, Lipeng
Date: Sun Jul 16 2023 - 10:48:16 EST


Hi Andrew,

> -----Original Message-----
> From: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Sent: Monday, July 3, 2023 5:11 AM
> To: Zhu, Lipeng <lipeng.zhu@xxxxxxxxx>
> Cc: viro@xxxxxxxxxxxxxxxxxx; brauner@xxxxxxxxxx; linux-
> fsdevel@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> Deng, Pan <pan.deng@xxxxxxxxx>; Ma, Yu <yu.ma@xxxxxxxxx>; Li, Tianyou
> <tianyou.li@xxxxxxxxx>; tim.c.chen@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] fs/address_space: add alignment padding for i_map and
> i_mmap_rwsem to mitigate a false sharing.
>
> On Wed, 28 Jun 2023 18:56:25 +0800 "Zhu, Lipeng" <lipeng.zhu@xxxxxxxxx>
> wrote:
>
> > When running UnixBench/Shell Scripts, we observed high false sharing
> > for accessing i_mmap against i_mmap_rwsem.
> >
> > UnixBench/Shell Scripts are typical load/execute command test
> > scenarios, the i_mmap will be accessed frequently to insert/remove
> vma_interval_tree.
> > Meanwhile, the i_mmap_rwsem is frequently loaded. Unfortunately, they
> > are in the same cacheline.
>
> That sounds odd. One would expect these two fields to be used in close
> conjunction, so any sharing might even be beneficial. Can you identify in more
> detail what's actually going on in there?
>

Yes, I'm running UnixBench/Shell Script which concurrently
launch->execute->exit a lot of shell commands.

During the workload running:
1: A lot of processes invoke vma_interval_tree_remove which touch
"i_mmap", the call stack:
----vma_interval_tree_remove
|----unlink_file_vma
| free_pgtables
| |----exit_mmap
| | mmput
| | |----begin_new_exec
| | | load_elf_binary
| | | bprm_execve

2: Also, there are a lot of processes touch 'i_mmap_rwsem' to acquire
the semaphore in order to access 'i_mmap'.
In existing 'address_space' layout, 'i_mmap' and 'i_mmap_rwsem' are in
the same cacheline.
struct address_space {
struct inode * host; /* 0 8 */
struct xarray i_pages; /* 8 16 */
struct rw_semaphore invalidate_lock; /* 24 40 */
/* --- cacheline 1 boundary (64 bytes) --- */
gfp_t gfp_mask; /* 64 4 */
atomic_t i_mmap_writable; /* 68 4 */
struct rb_root_cached i_mmap; /* 72 16 */
struct rw_semaphore i_mmap_rwsem; /* 88 40 */
/* --- cacheline 2 boundary (128 bytes) --- */
long unsigned int nrpages; /* 128 8 */
long unsigned int writeback_index; /* 136 8 */
const struct address_space_operations * a_ops; /* 144 8 */
long unsigned int flags; /* 152 8 */
errseq_t wb_err; /* 160 4 */
spinlock_t private_lock; /* 164 4 */
struct list_head private_list; /* 168 16 */
void * private_data; /* 184 8 */

/* size: 192, cachelines: 3, members: 15 */ };

Following perf c2c result shows heavy c2c bounce due to false sharing.
=================================================
Shared Cache Line Distribution Pareto
=================================================
-------------------------------------------------------------
0 3729 5791 0 0 0xff19b3818445c740
-------------------------------------------------------------
3.27% 3.02% 0.00% 0.00% 0x18 0 1 0xffffffffa194403b 604 483 389 692 203 [k] vma_interval_tree_insert [kernel.kallsyms] vma_interval_tree_insert+75 0 1
4.13% 3.63% 0.00% 0.00% 0x20 0 1 0xffffffffa19440a2 553 413 415 962 215 [k] vma_interval_tree_remove [kernel.kallsyms] vma_interval_tree_remove+18 0 1
2.04% 1.35% 0.00% 0.00% 0x28 0 1 0xffffffffa219a1d6 1210 855 460 1229 222 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+678 0 1
0.62% 1.85% 0.00% 0.00% 0x28 0 1 0xffffffffa219a1bf 762 329 577 527 198 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+655 0 1
0.48% 0.31% 0.00% 0.00% 0x28 0 1 0xffffffffa219a58c 1677 1476 733 1544 224 [k] down_write [kernel.kallsyms] down_write+28 0 1
0.05% 0.07% 0.00% 0.00% 0x28 0 1 0xffffffffa219a21d 1040 819 689 33 27 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+749 0 1
0.00% 0.05% 0.00% 0.00% 0x28 0 1 0xffffffffa17707db 0 1005 786 1373 223 [k] up_write [kernel.kallsyms] up_write+27 0 1
0.00% 0.02% 0.00% 0.00% 0x28 0 1 0xffffffffa219a064 0 233 778 32 30 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+308 0 1
33.82% 34.10% 0.00% 0.00% 0x30 0 1 0xffffffffa1770945 779 495 534 6011 224 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+53 0 1
17.06% 15.28% 0.00% 0.00% 0x30 0 1 0xffffffffa1770915 593 438 468 2715 224 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+5 0 1
3.54% 3.52% 0.00% 0.00% 0x30 0 1 0xffffffffa2199f84 881 601 583 1421 223 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+84 0 1

> > The patch places the i_mmap and i_mmap_rwsem in separate cache lines
> > to avoid this false sharing problem.
> >
> > With this patch, on Intel Sapphire Rapids 2 sockets 112c/224t
> > platform, based on kernel v6.4-rc4, the 224 parallel score is improved
> > ~2.5% for UnixBench/Shell Scripts case. And perf c2c tool shows the
> > false sharing is resolved as expected, the symbol
> > vma_interval_tree_remove disappeared in cache line 0 after this change.
>
> There can be many address_spaces in memory, so a size increase is a concern.
> Is there anything we can do to minimize the cost of this?

Thanks for your reminder of the memory size increased. After the padding, the
struct "address_space" need 4 cachelines to fill in, the memory size increased is
around ~33%.
Then I tried another approach by moving 'i_mmap_rwsem' under the field
'flags', no memory size increased for "address_space" and based on v6.4.0, on
Intel Sapphire Rapids 112C/224T platform, the score improves by ~5.3%. From
the perf c2c record data, the false sharing has been fixed for 'i_mmap' and
'i_mmap_rwsem'.

=================================================
Shared Cache Line Distribution Pareto
=================================================
-------------------------------------------------------------
0 556 838 0 0 0xff2780d7965d2780
-------------------------------------------------------------
0.18% 0.60% 0.00% 0.00% 0x8 0 1 0xffffffffafff27b8 503 453 569 14 13 [k] do_dentry_open [kernel.kallsyms] do_dentry_open+456 0 1
0.54% 0.12% 0.00% 0.00% 0x8 0 1 0xffffffffaffc51ac 510 199 428 15 12 [k] hugepage_vma_check [kernel.kallsyms] hugepage_vma_check+252 0 1
1.80% 2.15% 0.00% 0.00% 0x18 0 1 0xffffffffb079a1d6 1778 799 343 215 136 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+678 0 1
0.54% 1.31% 0.00% 0.00% 0x18 0 1 0xffffffffb079a1bf 547 296 528 91 71 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+655 0 1
0.72% 0.72% 0.00% 0.00% 0x18 0 1 0xffffffffb079a58c 1479 1534 676 288 163 [k] down_write [kernel.kallsyms] down_write+28 0 1
0.00% 0.12% 0.00% 0.00% 0x18 0 1 0xffffffffafd707db 0 2381 744 282 158 [k] up_write [kernel.kallsyms] up_write+27 0 1
0.00% 0.12% 0.00% 0.00% 0x18 0 1 0xffffffffb079a064 0 239 518 6 6 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+308 0 1
46.58% 47.02% 0.00% 0.00% 0x20 0 1 0xffffffffafd70945 704 403 499 1137 219 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+53 0 1
23.92% 25.78% 0.00% 0.00% 0x20 0 1 0xffffffffafd70915 558 413 500 542 185 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+5 0 1

I will send another patch out and update the commit log.

Lipeng Zhu,
Best Regards