Re: [PATCH 22/32] vfs: inode cache conversion to hash-bl

From: Dave Chinner
Date: Wed May 10 2023 - 00:46:16 EST


On Tue, May 09, 2023 at 12:56:47PM -0400, Kent Overstreet wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
>
> Because scalability of the global inode_hash_lock really, really
> sucks.
>
> 32-way concurrent create on a couple of different filesystems
> before:
>
> - 52.13% 0.04% [kernel] [k] ext4_create
> - 52.09% ext4_create
> - 41.03% __ext4_new_inode
> - 29.92% insert_inode_locked
> - 25.35% _raw_spin_lock
> - do_raw_spin_lock
> - 24.97% __pv_queued_spin_lock_slowpath
>
> - 72.33% 0.02% [kernel] [k] do_filp_open
> - 72.31% do_filp_open
> - 72.28% path_openat
> - 57.03% bch2_create
> - 56.46% __bch2_create
> - 40.43% inode_insert5
> - 36.07% _raw_spin_lock
> - do_raw_spin_lock
> 35.86% __pv_queued_spin_lock_slowpath
> 4.02% find_inode
>
> Convert the inode hash table to a RCU-aware hash-bl table just like
> the dentry cache. Note that we need to store a pointer to the
> hlist_bl_head the inode has been added to in the inode so that when
> it comes to unhash the inode we know what list to lock. We need to
> do this because the hash value that is used to hash the inode is
> generated from the inode itself - filesystems can provide this
> themselves so we have to either store the hash or the head pointer
> in the inode to be able to find the right list head for removal...
>
> Same workload after:
>
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>
> Cc: Christian Brauner <brauner@xxxxxxxxxx>
> Cc: linux-fsdevel@xxxxxxxxxxxxxxx
> Signed-off-by: Kent Overstreet <kent.overstreet@xxxxxxxxx>

I have been maintaining this patchset uptodate in my own local trees
and the code in this patch looks the same. The commit message above,
however, has been mangled. The full commit message should be:

vfs: inode cache conversion to hash-bl

Because scalability of the global inode_hash_lock really, really
sucks and prevents me from doing scalability characterisation and
analysis of bcachefs algorithms.

Profiles of a 32-way concurrent create of 51.2m inodes with fsmark
on a couple of different filesystems on a 5.10 kernel:

- 52.13% 0.04% [kernel] [k] ext4_create
- 52.09% ext4_create
- 41.03% __ext4_new_inode
- 29.92% insert_inode_locked
- 25.35% _raw_spin_lock
- do_raw_spin_lock
- 24.97% __pv_queued_spin_lock_slowpath


- 72.33% 0.02% [kernel] [k] do_filp_open
- 72.31% do_filp_open
- 72.28% path_openat
- 57.03% bch2_create
- 56.46% __bch2_create
- 40.43% inode_insert5
- 36.07% _raw_spin_lock
- do_raw_spin_lock
35.86% __pv_queued_spin_lock_slowpath
4.02% find_inode

btrfs was tested but it is limited by internal lock contention at
>=2 threads on this workload, so never hammers the inode cache lock
hard enough for this change to matter to it's performance.

However, both bcachefs and ext4 demonstrate poor scaling at >=8
threads on concurrent lookup or create workloads.

Hence convert the inode hash table to a RCU-aware hash-bl table just
like the dentry cache. Note that we need to store a pointer to the
hlist_bl_head the inode has been added to in the inode so that when
it comes to unhash the inode we know what list to lock. We need to
do this because, unlike the dentry cache, the hash value that is
used to hash the inode is not generated from the inode itself. i.e.
filesystems can provide this themselves so we have to either store
the hashval or the hlist head pointer in the inode to be able to
find the right list head for removal...

Concurrent create with variying thread count (files/s):

ext4 bcachefs
threads vanilla patched vanilla patched
2 117k 112k 80k 85k
4 185k 190k 133k 145k
8 303k 346k 185k 255k
16 389k 465k 190k 420k
32 360k 437k 142k 481k

CPU usage for both bcachefs and ext4 at 16 and 32 threads has been
halved on the patched kernel, while performance has increased
marginally on ext4 and massively on bcachefs. Internal filesystem
algorithms now limit performance on these workloads, not the global
inode_hash_lock.

Profile of the workloads on the patched kernels:

- 35.94% 0.07% [kernel] [k] ext4_create
- 35.87% ext4_create
- 20.45% __ext4_new_inode
...
3.36% insert_inode_locked

- 78.43% do_filp_open
- 78.36% path_openat
- 53.95% bch2_create
- 47.99% __bch2_create
....
- 7.57% inode_insert5
6.94% find_inode

Spinlock contention is largely gone from the inode hash operations
and the filesystems are limited by contention in their internal
algorithms.

Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
---

Other than that, the diffstat is the same and I don't see any obvious
differences in the code comapred to what I've been running locally.

-Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx