Re: (subset) [PATCH 22/32] vfs: inode cache conversion to hash-bl

From: Dave Chinner
Date: Fri Oct 20 2023 - 07:39:57 EST


On Thu, Oct 19, 2023 at 05:59:58PM +0200, Mateusz Guzik wrote:
> On Thu, Oct 19, 2023 at 05:30:40PM +0200, Mateusz Guzik wrote:
> > On Tue, May 23, 2023 at 11:28:38AM +0200, Christian Brauner wrote:
> > > On Tue, 09 May 2023 12:56:47 -0400, Kent Overstreet wrote:
> > > > Because scalability of the global inode_hash_lock really, really
> > > > sucks.
> > > >
> > > > 32-way concurrent create on a couple of different filesystems
> > > > before:
> > > >
> > > > - 52.13% 0.04% [kernel] [k] ext4_create
> > > > - 52.09% ext4_create
> > > > - 41.03% __ext4_new_inode
> > > > - 29.92% insert_inode_locked
> > > > - 25.35% _raw_spin_lock
> > > > - do_raw_spin_lock
> > > > - 24.97% __pv_queued_spin_lock_slowpath
> > > >
> > > > [...]
> > >
> > > This is interesting completely independent of bcachefs so we should give
> > > it some testing.
> > >
> > > I updated a few places that had outdated comments.
> > >
> > > ---
> > >
> > > Applied to the vfs.unstable.inode-hash branch of the vfs/vfs.git tree.
> > > Patches in the vfs.unstable.inode-hash branch should appear in linux-next soon.
> > >
> > > Please report any outstanding bugs that were missed during review in a
> > > new review to the original patch series allowing us to drop it.
> > >
> > > It's encouraged to provide Acked-bys and Reviewed-bys even though the
> > > patch has now been applied. If possible patch trailers will be updated.
> > >
> > > tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> > > branch: vfs.unstable.inode-hash
> > >
> > > [22/32] vfs: inode cache conversion to hash-bl
> > > https://git.kernel.org/vfs/vfs/c/e3e92d47e6b1
> >
> > What, if anything, is blocking this? It is over 5 months now, I don't
> > see it in master nor -next.

Not having a test machine that can validate my current vfs-scale
patchset for 4 of the 5 months makes it hard to measure and
demonstrate the efficacy of the changes on a current kernel....

> > To be clear there is no urgency as far as I'm concerned, but I did run
> > into something which is primarily bottlenecked by inode hash lock and
> > looks like the above should sort it out.
> >
> > Looks like the patch was simply forgotten.
> >
> > tl;dr can this land in -next please
>
> In case you can't be arsed, here is something funny which may convince
> you to expedite. ;)
>
> I did some benching by running 20 processes in parallel, each doing stat
> on a tree of 1 million files (one tree per proc, 1000 dirs x 1000 files,
> so 20 mln inodes in total). Box had 24 cores and 24G RAM.
>
> Best times:
> Linux: 7.60s user 1306.90s system 1863% cpu 1:10.55 total
> FreeBSD: 3.49s user 345.12s system 1983% cpu 17.573 total
> OpenBSD: 5.01s user 6463.66s system 2000% cpu 5:23.42 total
> DragonflyBSD: 11.73s user 1316.76s system 1023% cpu 2:09.78 total
> OmniosCE: 9.17s user 516.53s system 1550% cpu 33.905 total
>
> NetBSD failed to complete the run, OOM-killing workers:
> http://mail-index.netbsd.org/tech-kern/2023/10/19/msg029242.html
> OpenBSD is shafted by a big kernel lock, so no surprise it takes a long
> time.
>
> So what I find funny is that Linux needed more time than OmniosCE (an
> Illumos variant, fork of Solaris).
>
> It also needed more time than FreeBSD, which is not necessarily funny
> but not that great either.
>
> All systems were mostly busy contending on locks and in particular Linux
> was almost exclusively busy waiting on inode hash lock.

Did you bother to test the patch, or are you just complaining
that nobody has already done the work for you?

Because if you tested the patch, you'd have realised that by itself
it does nothing to improve performance of the concurrent find+stat
workload. The lock contention simply moves to the sb_inode_list_lock
instead.

Patches to get rid of the sb_inode_list_lock contention were
written by smarter people than me long before I wrote the hash-bl
patches. However, the problem the dlist stuff was written to address
(sync iterating all inodes causing sb inode list lock contention
when we have a hundred million cached inodes in memory) was better
fixed by changing the sync algorithm not to iterate all cached
inodes just to find the handful of dirty ones it needed to sync.

IOWs, those sb_inode_list_lock changes haven't been included for the
same reason as the hash-bl patches: outside micro-benchmarks, these
locks just don't show up in profiles on production machines.
Hence there's no urgency to "fix" these lock contention
problems despite the ease with which micro-benchmarks can reproduce
it...

I've kept the patches current for years, even though there hasn't
been a pressing need for them. The last "vfs-scale" version I did
some validation on is here:

https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/log/?h=vfs-scale

5.17 was the last kernel I did any serious validation and
measurement against, and that all needs to be repeated before
proposing it for inclusion because lots of stuff has changed since I
last did some serious multi-filesystem a/b testing of this code....

-Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx