Re: [PATCH] mm: zswap: multiple zpool support

From: Yosry Ahmed
Date: Thu Jun 01 2023 - 13:52:23 EST


On Thu, Jun 1, 2023 at 8:58 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Wed, May 31, 2023 at 02:29:11AM +0000, Yosry Ahmed wrote:
> > Support using multiple zpools of the same type in zswap, for concurrency
> > purposes. Add CONFIG_ZSWAP_NR_ZPOOLS_ORDER to control the number of
> > zpools. The order is specific by the config rather than the absolute
> > number to guarantee a power of 2. This is useful so that we can use
> > deterministically link each entry to a zpool by hashing the zswap_entry
> > pointer.
> >
> > On a setup with zswap and zsmalloc, comparing a single zpool (current
> > default) to 32 zpools (by setting CONFIG_ZSWAP_NR_ZPOOLS_ORDER=32) shows
> > improvements in the zsmalloc lock contention, especially on the swap out
> > path.
> >
> > The following shows the perf analysis of the swapout path when 10
> > workloads are simulatenously reclaiming and refaulting tmpfs pages.
> > There are some improvements on the swapin path as well, but much less
> > significant.
> >
> > 1 zpool:
> >
> > |--28.99%--zswap_frontswap_store
> > | |
> > <snip>
> > | |
> > | |--8.98%--zpool_map_handle
> > | | |
> > | | --8.98%--zs_zpool_map
> > | | |
> > | | --8.95%--zs_map_object
> > | | |
> > | | --8.38%--_raw_spin_lock
> > | | |
> > | | --7.39%--queued_spin_lock_slowpath
> > | |
> > | |--8.82%--zpool_malloc
> > | | |
> > | | --8.82%--zs_zpool_malloc
> > | | |
> > | | --8.80%--zs_malloc
> > | | |
> > | | |--7.21%--_raw_spin_lock
> > | | | |
> > | | | --6.81%--queued_spin_lock_slowpath
> > <snip>
> >
> > 32 zpools:
> >
> > |--16.73%--zswap_frontswap_store
> > | |
> > <snip>
> > | |
> > | |--1.81%--zpool_malloc
> > | | |
> > | | --1.81%--zs_zpool_malloc
> > | | |
> > | | --1.79%--zs_malloc
> > | | |
> > | | --0.73%--obj_malloc
> > | |
> > | |--1.06%--zswap_update_total_size
> > | |
> > | |--0.59%--zpool_map_handle
> > | | |
> > | | --0.59%--zs_zpool_map
> > | | |
> > | | --0.57%--zs_map_object
> > | | |
> > | | --0.51%--_raw_spin_lock
> > <snip>
> >
> > Suggested-by: Yu Zhao <yuzhao@xxxxxxxxxx>
> > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > ---
> > mm/Kconfig | 12 +++++++
> > mm/zswap.c | 95 ++++++++++++++++++++++++++++++++++++------------------
> > 2 files changed, 76 insertions(+), 31 deletions(-)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 92c30879bf67..de1da56d2c07 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -59,6 +59,18 @@ config ZSWAP_EXCLUSIVE_LOADS
> > The cost is that if the page was never dirtied and needs to be
> > swapped out again, it will be re-compressed.
> >
> > +config ZSWAP_NR_ZPOOLS_ORDER
> > + int "Number of zpools in zswap, as power of 2"
> > + default 0
> > + depends on ZSWAP
> > + help
> > + This options determines the number of zpools to use for zswap, it
> > + will be 1 << CONFIG_ZSWAP_NR_ZPOOLS_ORDER.
> > +
> > + Having multiple zpools helps with concurrency and lock contention
> > + on the swap in and swap out paths, but uses a little bit of extra
> > + space.
>
> This is nearly impossible for a user, let alone a distribution, to
> answer adequately.
>
> The optimal value needs to be found empirically. And it varies heavily
> not just by workload but by implementation changes. If we make changes
> to the lock holding time or redo the data structures, a previously
> chosen value might no longer be a net positive, and even be harmful.

Yeah, I agree that this can only be tuned empirically, but there is a
real benefit to tuning it, at least in our experience. I imagined
having the config option with a default of 0 gives those who want to
tune it the option, while not messing with users that do not care.

>
> Architecturally, the pool scoping and the interaction with zswap_tree
> is currently a mess. We're aware of lifetime bugs, where swapoff kills
> the tree but the pool still exists with entries making dead references
> e.g. We likely need to rearchitect this in the near future - maybe tie
> pools to trees to begin with.
>
> (I'm assuming you're already using multiple swap files to avoid tree
> lock contention, because it has the same scope as the pool otherwise.)
>
> Such changes quickly invalidate any settings the user or distro might
> make here. Exposing the implementation detail of the pools might even
> get in the way of fixing bugs and cleaning up the architecture.

I was under the impression that config options are not very stable.
IOW, if we fix the lock contention on an architectural level, and
there is no more benefit to tuning the number of zpools per zswap
pool, we can remove the config option. Is this incorrect?

>
> > @@ -263,11 +266,13 @@ static void zswap_update_total_size(void)
> > {
> > struct zswap_pool *pool;
> > u64 total = 0;
> > + int i;
> >
> > rcu_read_lock();
> >
> > list_for_each_entry_rcu(pool, &zswap_pools, list)
> > - total += zpool_get_total_size(pool->zpool);
> > + for (i = 0; i < zswap_nr_zpools; i++)
> > + total += zpool_get_total_size(pool->zpools[i]);
>
> This adds a O(nr_pools) operation to every load and store. It's easy
> for this to dominate or outweigh locking costs as workload concurrency
> changes, or data structures and locking change inside the kernel.

Right, which is why this is empirically tuned. In the perf analysis in
the commit log, the lock contention gains far outweigh this cost.
FWIW, the cost here is constant, we just iterate the pools and read a
value.

>
> > @@ -587,14 +603,17 @@ static void shrink_worker(struct work_struct *w)
> > {
> > struct zswap_pool *pool = container_of(w, typeof(*pool),
> > shrink_work);
> > + int i;
> >
> > - if (zpool_shrink(pool->zpool, 1, NULL))
> > - zswap_reject_reclaim_fail++;
> > + for (i = 0; i < zswap_nr_zpools; i++)
> > + if (zpool_shrink(pool->zpools[i], 1, NULL))
> > + zswap_reject_reclaim_fail++;
> > zswap_pool_put(pool);
>
> This scales reclaim batch size by the number of zpools, which can lead
> to varying degrees of overreclaim especially on small zswap sizes with
> high pool concurrency.

I was under the assumption that with Domenico's patch we will mostly
be reclaiming multiple pages anyway, but I can certainly remove this
part and only reclaim one page at a time from one zpool. We can select
one at random or round robin through the zpools.

>
> I don't think this patch is ready for primetime. A user build time
> setting is not appropriate for an optimization that is heavily tied to
> implementation details and workload dynamics.

What would you suggest instead? We do find value in having multiple
zpools, at least for the current architecture.

An internal implementation that we have exposes this as a module
parameter instead, but that is more complicated (I image), because you
need to set it before enabling zswap, or before changing the zswap
pool, otherwise changing it is nop because the zpool(s) is already
allocated. I am also guessing module params are more stable than
config options. Hence, I thought a config option might be more
appropriate.