Re: [PATCH] mm: zswap: multiple zpool support

From: Johannes Weiner
Date: Thu Jun 01 2023 - 11:58:36 EST


On Wed, May 31, 2023 at 02:29:11AM +0000, Yosry Ahmed wrote:
> Support using multiple zpools of the same type in zswap, for concurrency
> purposes. Add CONFIG_ZSWAP_NR_ZPOOLS_ORDER to control the number of
> zpools. The order is specific by the config rather than the absolute
> number to guarantee a power of 2. This is useful so that we can use
> deterministically link each entry to a zpool by hashing the zswap_entry
> pointer.
>
> On a setup with zswap and zsmalloc, comparing a single zpool (current
> default) to 32 zpools (by setting CONFIG_ZSWAP_NR_ZPOOLS_ORDER=32) shows
> improvements in the zsmalloc lock contention, especially on the swap out
> path.
>
> The following shows the perf analysis of the swapout path when 10
> workloads are simulatenously reclaiming and refaulting tmpfs pages.
> There are some improvements on the swapin path as well, but much less
> significant.
>
> 1 zpool:
>
> |--28.99%--zswap_frontswap_store
> | |
> <snip>
> | |
> | |--8.98%--zpool_map_handle
> | | |
> | | --8.98%--zs_zpool_map
> | | |
> | | --8.95%--zs_map_object
> | | |
> | | --8.38%--_raw_spin_lock
> | | |
> | | --7.39%--queued_spin_lock_slowpath
> | |
> | |--8.82%--zpool_malloc
> | | |
> | | --8.82%--zs_zpool_malloc
> | | |
> | | --8.80%--zs_malloc
> | | |
> | | |--7.21%--_raw_spin_lock
> | | | |
> | | | --6.81%--queued_spin_lock_slowpath
> <snip>
>
> 32 zpools:
>
> |--16.73%--zswap_frontswap_store
> | |
> <snip>
> | |
> | |--1.81%--zpool_malloc
> | | |
> | | --1.81%--zs_zpool_malloc
> | | |
> | | --1.79%--zs_malloc
> | | |
> | | --0.73%--obj_malloc
> | |
> | |--1.06%--zswap_update_total_size
> | |
> | |--0.59%--zpool_map_handle
> | | |
> | | --0.59%--zs_zpool_map
> | | |
> | | --0.57%--zs_map_object
> | | |
> | | --0.51%--_raw_spin_lock
> <snip>
>
> Suggested-by: Yu Zhao <yuzhao@xxxxxxxxxx>
> Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> ---
> mm/Kconfig | 12 +++++++
> mm/zswap.c | 95 ++++++++++++++++++++++++++++++++++++------------------
> 2 files changed, 76 insertions(+), 31 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 92c30879bf67..de1da56d2c07 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -59,6 +59,18 @@ config ZSWAP_EXCLUSIVE_LOADS
> The cost is that if the page was never dirtied and needs to be
> swapped out again, it will be re-compressed.
>
> +config ZSWAP_NR_ZPOOLS_ORDER
> + int "Number of zpools in zswap, as power of 2"
> + default 0
> + depends on ZSWAP
> + help
> + This options determines the number of zpools to use for zswap, it
> + will be 1 << CONFIG_ZSWAP_NR_ZPOOLS_ORDER.
> +
> + Having multiple zpools helps with concurrency and lock contention
> + on the swap in and swap out paths, but uses a little bit of extra
> + space.

This is nearly impossible for a user, let alone a distribution, to
answer adequately.

The optimal value needs to be found empirically. And it varies heavily
not just by workload but by implementation changes. If we make changes
to the lock holding time or redo the data structures, a previously
chosen value might no longer be a net positive, and even be harmful.

Architecturally, the pool scoping and the interaction with zswap_tree
is currently a mess. We're aware of lifetime bugs, where swapoff kills
the tree but the pool still exists with entries making dead references
e.g. We likely need to rearchitect this in the near future - maybe tie
pools to trees to begin with.

(I'm assuming you're already using multiple swap files to avoid tree
lock contention, because it has the same scope as the pool otherwise.)

Such changes quickly invalidate any settings the user or distro might
make here. Exposing the implementation detail of the pools might even
get in the way of fixing bugs and cleaning up the architecture.

> @@ -263,11 +266,13 @@ static void zswap_update_total_size(void)
> {
> struct zswap_pool *pool;
> u64 total = 0;
> + int i;
>
> rcu_read_lock();
>
> list_for_each_entry_rcu(pool, &zswap_pools, list)
> - total += zpool_get_total_size(pool->zpool);
> + for (i = 0; i < zswap_nr_zpools; i++)
> + total += zpool_get_total_size(pool->zpools[i]);

This adds a O(nr_pools) operation to every load and store. It's easy
for this to dominate or outweigh locking costs as workload concurrency
changes, or data structures and locking change inside the kernel.

> @@ -587,14 +603,17 @@ static void shrink_worker(struct work_struct *w)
> {
> struct zswap_pool *pool = container_of(w, typeof(*pool),
> shrink_work);
> + int i;
>
> - if (zpool_shrink(pool->zpool, 1, NULL))
> - zswap_reject_reclaim_fail++;
> + for (i = 0; i < zswap_nr_zpools; i++)
> + if (zpool_shrink(pool->zpools[i], 1, NULL))
> + zswap_reject_reclaim_fail++;
> zswap_pool_put(pool);

This scales reclaim batch size by the number of zpools, which can lead
to varying degrees of overreclaim especially on small zswap sizes with
high pool concurrency.

I don't think this patch is ready for primetime. A user build time
setting is not appropriate for an optimization that is heavily tied to
implementation details and workload dynamics.