Re: [PATCH] mm: zswap: multiple zpool support

From: Yosry Ahmed
Date: Tue Jun 13 2023 - 16:15:33 EST


On Mon, Jun 5, 2023 at 6:56 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Fri, Jun 2, 2023 at 1:24 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > On Fri, Jun 02, 2023 at 12:14:28PM -0700, Yosry Ahmed wrote:
> > > On Fri, Jun 2, 2023 at 11:34 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > > >
> > > > On Fri, Jun 02, 2023 at 09:59:20AM -0700, Yosry Ahmed wrote:
> > > > > On Fri, Jun 2, 2023 at 9:49 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > > > > > Again, what about the zswap_tree.lock and swap_info_struct.lock?
> > > > > > They're the same scope unless you use multiple swap files. Would it
> > > > > > make sense to tie pools to trees, so that using multiple swapfiles for
> > > > > > concurrency purposes also implies this optimization?
> > > > >
> > > > > Yeah, using multiple swapfiles helps with those locks, but it doesn't
> > > > > help with the zpool lock.
> > > > >
> > > > > I am reluctant to take this path because I am trying to get rid of
> > > > > zswap's dependency on swapfiles to begin with, and have it act as its
> > > > > own standalone swapping backend. If I am successful, then having one
> > > > > zpool per zswap_tree is just a temporary fix.
> > > >
> > > > What about making the pools per-cpu?
> > > >
> > > > This would scale nicely with the machine size. And we commonly deal
> > > > with for_each_cpu() loops and per-cpu data structures, so have good
> > > > developer intuition about what's reasonable to squeeze into those.
> > > >
> > > > It would eliminate the lock contention, for everybody, right away, and
> > > > without asking questions.
> > > >
> > > > It would open the door to all kinds of locking optimizations on top.
> > >
> > > The page can get swapped out on one cpu and swapped in on another, no?
> > >
> > > We will need to store which zpool the page is stored in in its zswap
> > > entry, and potentially grab percpu locks from other cpus in the swap
> > > in path. The lock contention would probably be less, but certainly not
> > > eliminated.
> > >
> > > Did I misunderstand?
> >
> > Sorry, I should have been more precise.
> >
> > I'm saying that using NR_CPUS pools, and replacing the hash with
> > smp_processor_id(), would accomplish your goal of pool concurrency.
> > But it would do so with a broadly-used, well-understood scaling
> > factor. We might not need a config option at all.
> >
> > The lock would still be there, but contention would be reduced fairly
> > optimally (barring preemption) for store concurrency at least. Not
> > fully eliminated due to frees and compaction, though, yes.

I thought about this again and had some internal discussions, and I am
more unsure about it. Beyond the memory overhead, having too many
zpools might result in higher fragmentation within the zpools. For
zsmalloc, we do not compact across multiple zpools for example.

We have been using a specific number of zpools in our production for
years now, we know it can be tuned to achieve performance gains. OTOH,
percpu zpools (or NR_CPUS pools) seems like too big of a hammer,
probably too many zpools in a lot of cases, and we wouldn't know how
many zpools actually fits our workloads.

I see value in allowing the number of zpools to be directly
configurable (it can always be left as 1), and am worried that with
percpu we would be throwing away years of production testing for an
unknown.

I am obviously biased, but I don't think this adds significant
complexity to the zswap code as-is (or as v2 is to be precise).

WDYT?

>
> Yeah I think we can do that. I looked at the size of the zsmalloc pool
> as an example, and it seems to be less than 1K, so having one percpu
> seems okay.
>
> There are a few things that we will need to do:
> - Rework zswap_update_total_size(). We don't want to loop through all
> cpus on each load/store. We can be smarter about it and inc/dec the
> total zswap pool size each time we allocate or free a page in the
> driver. This might need some plumbing from the drivers to zswap (or
> passing a callback from zswap to the drivers).
>
> - Update zsmalloc such that all pool share kmem caches, instead of
> creating two kmem caches for zsmalloc percpu. This was a follow-up I
> had in mind for multiple zpools support anyway, but I guess it's more
> significant if we have NR_CPUS pools.
>
> I was nervous about increasing the size of struct zswap_entry to store
> the cpu/zpool where the entry resides, but I realized we can replace
> the pointer to zswap_pool in struct zswap_entry with a pointer to
> zpool, and add a zswap_pool pointer in struct zpool. This will
> actually trim down the common "entry->pool->zpool" to just
> "entry->zpool", and then we can replace any "entry->pool" with
> "entry->zpool->pool".
>
> @Yu Zhao, any thoughts on this? The multiple zpools support was
> initially your idea (and did the initial implementation) -- so your
> input is very valuable here.
>
> >
> > I'm not proposing more than that at this point. I only wrote the last
> > line because already having per-cpu data structures might help with
> > fast path optimizations down the line, if contention is still an
> > issue. But unlikely. So it's not so important. Let's forget it.