Re: [PATCH] mm: zswap: multiple zpool support

From: Yosry Ahmed
Date: Fri Jun 02 2023 - 13:00:06 EST


On Fri, Jun 2, 2023 at 9:49 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Thu, Jun 01, 2023 at 10:51:39AM -0700, Yosry Ahmed wrote:
> > On Thu, Jun 1, 2023 at 8:58 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > > On Wed, May 31, 2023 at 02:29:11AM +0000, Yosry Ahmed wrote:
> > > > +config ZSWAP_NR_ZPOOLS_ORDER
> > > > + int "Number of zpools in zswap, as power of 2"
> > > > + default 0
> > > > + depends on ZSWAP
> > > > + help
> > > > + This options determines the number of zpools to use for zswap, it
> > > > + will be 1 << CONFIG_ZSWAP_NR_ZPOOLS_ORDER.
> > > > +
> > > > + Having multiple zpools helps with concurrency and lock contention
> > > > + on the swap in and swap out paths, but uses a little bit of extra
> > > > + space.
> > >
> > > This is nearly impossible for a user, let alone a distribution, to
> > > answer adequately.
> > >
> > > The optimal value needs to be found empirically. And it varies heavily
> > > not just by workload but by implementation changes. If we make changes
> > > to the lock holding time or redo the data structures, a previously
> > > chosen value might no longer be a net positive, and even be harmful.
> >
> > Yeah, I agree that this can only be tuned empirically, but there is a
> > real benefit to tuning it, at least in our experience. I imagined
> > having the config option with a default of 0 gives those who want to
> > tune it the option, while not messing with users that do not care.
>
> Realistically, how many users besides you will be able to do this
> tuning and benefit from this?

It's not difficult to try increasing the number of zpools and observe
reduction in lock contention vs. increase in cost for updating total
pool size. Updating the pool size is lockless and only involves an
atomic_read(), so I imagine it won't start showing up as a degradation
until the number of zpools is significant.

>
> > > Architecturally, the pool scoping and the interaction with zswap_tree
> > > is currently a mess. We're aware of lifetime bugs, where swapoff kills
> > > the tree but the pool still exists with entries making dead references
> > > e.g. We likely need to rearchitect this in the near future - maybe tie
> > > pools to trees to begin with.
> > >
> > > (I'm assuming you're already using multiple swap files to avoid tree
> > > lock contention, because it has the same scope as the pool otherwise.)
> > >
> > > Such changes quickly invalidate any settings the user or distro might
> > > make here. Exposing the implementation detail of the pools might even
> > > get in the way of fixing bugs and cleaning up the architecture.
> >
> > I was under the impression that config options are not very stable.
> > IOW, if we fix the lock contention on an architectural level, and
> > there is no more benefit to tuning the number of zpools per zswap
> > pool, we can remove the config option. Is this incorrect?
>
> The problem is that it complicates the code for everybody, for the
> benefit of a likely very small set of users - who have now opted out
> of any need for making the code better.
>
> And we have to keep this, and work around it, until things work for
> thosee users who have no incentive to address the underlying issues.
>
> That's difficult to justify.

I agree that it makes the code less digestible, but is it unbearably
terrible? I think it's arguably within reason.

>
> > > I don't think this patch is ready for primetime. A user build time
> > > setting is not appropriate for an optimization that is heavily tied to
> > > implementation details and workload dynamics.
> >
> > What would you suggest instead? We do find value in having multiple
> > zpools, at least for the current architecture.
>
> I think it makes sense to look closer at the lock contention, and
> whether this can be improved more generically.
>
> zbud and z3fold don't have a lock in the ->map callback that heavily
> shows in the profile in your changelog. Is this zsmalloc specific?

We have observed this with zsmalloc, yes.

>
> If so, looking more closely at zsmalloc locking makes more sense. Is a
> lockless implementation feasible? Can we do rw-locking against
> zs_compact() to allow concurrency in zs_map_object()?

+Sergey Senozhatsky +Minchan Kim

I am not really sure, but it's not just zs_map_object(), it's also
zs_malloc() and I suspect other functions as well.

>
> This would benefit everybody, even zram users.
>
> Again, what about the zswap_tree.lock and swap_info_struct.lock?
> They're the same scope unless you use multiple swap files. Would it
> make sense to tie pools to trees, so that using multiple swapfiles for
> concurrency purposes also implies this optimization?

Yeah, using multiple swapfiles helps with those locks, but it doesn't
help with the zpool lock.

I am reluctant to take this path because I am trying to get rid of
zswap's dependency on swapfiles to begin with, and have it act as its
own standalone swapping backend. If I am successful, then having one
zpool per zswap_tree is just a temporary fix.