[PATCH v6 0/7] kernfs: Introduce hashed mutexes to replace global kernfs_open_file_mutex.

From: Imran Khan
Date: Mon Feb 14 2022 - 07:03:48 EST


Reduce contention around global locks used in kernfs.

The current kernfs design makes use of 3 global locks to synchronize
various operations. There are a few other global locks as well but
they are used for specific cases and hence don't cause severe contention.

The above mentioned 3 main global locks used in kernfs are:

1. A global mutex, kernfs_open_file_mutex, to protect the list of
kernfs_open_file instances correspondng to a sysfs attribute.

2. A global spinlock, kernfs_open_node_lock, to protect
kernfs_node->attr.open which points to kernfs_open_node instance
corresponding to a kernfs_node.

3. A global per-fs rw semaphore, root->kernfs_rwsem, to synchronize most
of the other operations like adding, removing, renaming etc. of a file
or directory.

Since these locks are global, they can cause contention when multiple
(for example few hundred) applications try to access sysfs (or other kernfs
based file system) files in parallel, even if the applications are
accessing different and unrelated files.

For example on a system with 384 cores, if I run 200 instances of an
application which is mostly executing the following loop:

for (int loop = 0; loop <100 ; loop++)
{
for (int port_num = 1; port_num < 2; port_num++)
{
for (int gid_index = 0; gid_index < 254; gid_index++ )
{
char ret_buf[64], ret_buf_lo[64];
char gid_file_path[1024];

int ret_len;
int ret_fd;
ssize_t ret_rd;

ub4 i, saved_errno;

memset(ret_buf, 0, sizeof(ret_buf));
memset(gid_file_path, 0, sizeof(gid_file_path));

ret_len = snprintf(gid_file_path, sizeof(gid_file_path),
"/sys/class/infiniband/%s/ports/%d/gids/%d",
dev_name,
port_num,
gid_index);

ret_fd = open(gid_file_path, O_RDONLY | O_CLOEXEC);
if (ret_fd < 0)
{
printf("Failed to open %s\n", gid_file_path);
continue;
}

/* Read the GID */
ret_rd = read(ret_fd, ret_buf, 40);

if (ret_rd == -1)
{
printf("Failed to read from file %s, errno: %u\n",
gid_file_path, saved_errno);

continue;
}

close(ret_fd);
}
}

I can see contention around above mentioned locks as follows:

- 54.07% 53.60% showgids [kernel.kallsyms] [k] osq_lock
- 53.60% __libc_start_main
- 32.29% __GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
path_openat
vfs_open
do_dentry_open
kernfs_fop_open
mutex_lock
- __mutex_lock_slowpath
- 32.23% __mutex_lock.isra.5
osq_lock
- 21.31% __GI___libc_close
entry_SYSCALL_64_after_hwframe
do_syscall_64
exit_to_usermode_loop
task_work_run
____fput
__fput
kernfs_fop_release
kernfs_put_open_node.isra.8
mutex_lock
- __mutex_lock_slowpath
- 21.28% __mutex_lock.isra.5
osq_lock

- 10.49% 10.39% showgids [kernel.kallsyms] [k] down_read
10.39% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 9.72% link_path_walk
- 5.21% inode_permission
- __inode_permission
- 5.21% kernfs_iop_permission
down_read
- 4.08% walk_component
lookup_fast
- d_revalidate.part.24
- 4.08% kernfs_dop_revalidate

- 7.48% 7.41% showgids [kernel.kallsyms] [k] up_read
7.41% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 7.01% link_path_walk
- 4.12% inode_permission
- __inode_permission
- 4.12% kernfs_iop_permission
up_read
- 2.61% walk_component
lookup_fast
- d_revalidate.part.24
- 2.61% kernfs_dop_revalidate

Moreover this run of 200 application isntances takes 32-34 secs. to
complete.

This patch set is reducing the above mentioned contention by replacing
these global locks with hashed locks.

For example with the patched kernel and on the same test setup, we no
longer see contention around osq_lock (i.e kernfs_open_file_mutex) and also
contention around per-fs kernfs_rwsem has reduced significantly as well.
This can be seen in the following perf snippet:

- 1.66% 1.65% showgids [kernel.kallsyms] [k] down_read
1.65% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 1.62% link_path_walk
- 0.98% inode_permission
- __inode_permission
+ 0.98% kernfs_iop_permission
- 0.52% walk_component
lookup_fast
- d_revalidate.part.24
- 0.52% kernfs_dop_revalidate

- 1.12% 1.11% showgids [kernel.kallsyms] [k] up_read
1.11% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 1.11% link_path_walk
- 0.69% inode_permission
- __inode_permission
- 0.69% kernfs_iop_permission
up_read

Moreover the test execution time has reduced from 32-34 secs to 18-19 secs.

The patches of this patchset introduce following changes:

PATCH-1: Introduce hashed mutexes to replace global kernfs_open_file_mutex.

PATCH-2: Replace global kernfs_open_file_mutex with hashed mutexes.

PATCH-3: Introduce hashed spinlocks to replace global kernfs_open_node_lock.

PATCH-4: Replace global kernfs_open_node_lock with hashed spinlocks.

PATCH-5: Use a per-fs rwsem to protect list of kernfs_super_info.

PATCH-6: Introduce hashed rw-sem to replace per-fs kernfs_rwsem.

PATCH-7: Replace per-fs rwsem with hashed ones.

------------------------------------------------------------------
Changes since v5:
- Addressed review comments from Greg
- Clean up duplicate code.
- Addressed review comments from Tejun
- Explain how current value of NR_KERNFS_LOCKS were obtained for
different values of NR_CPUS.
- Use single hash table for locks in place of per-fs hash table
for locks.
- Move introduction of supers_rwsem to a separate patch.
- Separate interface and usage part of hashed rwsem implementation
into 2 patches.
- Use address of rwsems to determine locking order in case of
nested locking. This approach avoids ABBA deadlock possibility.
- Change some #define macros into enum, with proper prefix.
- Taking a cue from Tejun's feedback about separating hashed rwsem
interface and usage into 2 patches, I have also divided the patch
that introduced hashed mutex and spinlock, into separate patches.
- Rebase on linux-next tag: next-20220211

Changes since v4:
- Removed compilation warnings reported by the 0-day bot.

Changes since v3:
- Make open_file_mutex and open_node_lock per-fs.
- Replace per-fs rwsem with per-fs hashed rwsem.
(Suggested by Tejun Heo <tj@xxxxxxxxxx>)

Imran Khan (7):
kernfs: Introduce hashed mutexes to replace global
kernfs_open_file_mutex.
kernfs: Replace global kernfs_open_file_mutex with hashed mutexes.
kernfs: Introduce hashed spinlocks to replace global
kernfs_open_node_lock.
kernfs: Replace global kernfs_open_node_lock with hashed spinlocks.
kernfs: Use a per-fs rwsem to protect per-fs list of
kernfs_super_info.
kernfs: Introduce hashed rw-sem to replace per-fs kernfs_rwsem.
kernfs: Replace per-fs rwsem with hashed ones.

fs/kernfs/dir.c | 132 ++++----
fs/kernfs/file.c | 65 ++--
fs/kernfs/inode.c | 22 +-
fs/kernfs/kernfs-internal.h | 604 +++++++++++++++++++++++++++++++++++-
fs/kernfs/mount.c | 29 +-
fs/kernfs/symlink.c | 5 +-
include/linux/kernfs.h | 69 ++++
7 files changed, 802 insertions(+), 124 deletions(-)


base-commit: 6d9bd4ad4ca08b1114e814c2c42383b8b13be631
--
2.30.2