Re: [PATCH] mm: zswap: multiple zpool support

From: Johannes Weiner
Date: Fri Jun 02 2023 - 12:49:52 EST


On Thu, Jun 01, 2023 at 10:51:39AM -0700, Yosry Ahmed wrote:
> On Thu, Jun 1, 2023 at 8:58 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > On Wed, May 31, 2023 at 02:29:11AM +0000, Yosry Ahmed wrote:
> > > +config ZSWAP_NR_ZPOOLS_ORDER
> > > + int "Number of zpools in zswap, as power of 2"
> > > + default 0
> > > + depends on ZSWAP
> > > + help
> > > + This options determines the number of zpools to use for zswap, it
> > > + will be 1 << CONFIG_ZSWAP_NR_ZPOOLS_ORDER.
> > > +
> > > + Having multiple zpools helps with concurrency and lock contention
> > > + on the swap in and swap out paths, but uses a little bit of extra
> > > + space.
> >
> > This is nearly impossible for a user, let alone a distribution, to
> > answer adequately.
> >
> > The optimal value needs to be found empirically. And it varies heavily
> > not just by workload but by implementation changes. If we make changes
> > to the lock holding time or redo the data structures, a previously
> > chosen value might no longer be a net positive, and even be harmful.
>
> Yeah, I agree that this can only be tuned empirically, but there is a
> real benefit to tuning it, at least in our experience. I imagined
> having the config option with a default of 0 gives those who want to
> tune it the option, while not messing with users that do not care.

Realistically, how many users besides you will be able to do this
tuning and benefit from this?

> > Architecturally, the pool scoping and the interaction with zswap_tree
> > is currently a mess. We're aware of lifetime bugs, where swapoff kills
> > the tree but the pool still exists with entries making dead references
> > e.g. We likely need to rearchitect this in the near future - maybe tie
> > pools to trees to begin with.
> >
> > (I'm assuming you're already using multiple swap files to avoid tree
> > lock contention, because it has the same scope as the pool otherwise.)
> >
> > Such changes quickly invalidate any settings the user or distro might
> > make here. Exposing the implementation detail of the pools might even
> > get in the way of fixing bugs and cleaning up the architecture.
>
> I was under the impression that config options are not very stable.
> IOW, if we fix the lock contention on an architectural level, and
> there is no more benefit to tuning the number of zpools per zswap
> pool, we can remove the config option. Is this incorrect?

The problem is that it complicates the code for everybody, for the
benefit of a likely very small set of users - who have now opted out
of any need for making the code better.

And we have to keep this, and work around it, until things work for
thosee users who have no incentive to address the underlying issues.

That's difficult to justify.

> > I don't think this patch is ready for primetime. A user build time
> > setting is not appropriate for an optimization that is heavily tied to
> > implementation details and workload dynamics.
>
> What would you suggest instead? We do find value in having multiple
> zpools, at least for the current architecture.

I think it makes sense to look closer at the lock contention, and
whether this can be improved more generically.

zbud and z3fold don't have a lock in the ->map callback that heavily
shows in the profile in your changelog. Is this zsmalloc specific?

If so, looking more closely at zsmalloc locking makes more sense. Is a
lockless implementation feasible? Can we do rw-locking against
zs_compact() to allow concurrency in zs_map_object()?

This would benefit everybody, even zram users.

Again, what about the zswap_tree.lock and swap_info_struct.lock?
They're the same scope unless you use multiple swap files. Would it
make sense to tie pools to trees, so that using multiple swapfiles for
concurrency purposes also implies this optimization?