Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call

From: Yosry Ahmed
Date: Wed Jan 17 2024 - 14:31:37 EST


On Wed, Jan 17, 2024 at 11:14 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero
> <debug.penguin32@xxxxxxxxx> wrote:
>
> + Johannes and Yosry
>
> >
> > The core-api create_workqueue is deprecated, this patch replaces
> > the create_workqueue with alloc_workqueue. The previous
> > implementation workqueue of zswap was a bounded workqueue, this
> > patch uses alloc_workqueue() to create an unbounded workqueue.
> > The WQ_UNBOUND attribute is desirable making the workqueue
> > not localized to a specific cpu so that the scheduler is free
> > to exercise improvisations in any demanding scenarios for
> > offloading cpu time slices for workqueues.
>
> nit: extra space between paragraph would be nice.
>
> > For example if any other workqueues of the same primary cpu
> > had to be served which are WQ_HIGHPRI and WQ_CPU_INTENSIVE.
> > Also Unbound workqueue happens to be more efficient
> > in a system during memory pressure scenarios in comparison
> > to a bounded workqueue.
> >
> > shrink_wq = alloc_workqueue("zswap-shrink",
> > WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
> >
> > Overall the change suggested in this patch should be
> > seamless and does not alter the existing behavior,
> > other than the improvisation to be an unbounded workqueue.
> >
> > Signed-off-by: Ronald Monthero <debug.penguin32@xxxxxxxxx>
> > ---
> > mm/zswap.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 74411dfdad92..64dbe3e944a2 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1620,7 +1620,8 @@ static int zswap_setup(void)
> > zswap_enabled = false;
> > }
> >
> > - shrink_wq = create_workqueue("zswap-shrink");
> > + shrink_wq = alloc_workqueue("zswap-shrink",
> > + WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
>
> Have you benchmarked this to check if there is any regression, just to
> be safe? With an unbounded workqueue, you're gaining scheduling
> flexibility at the cost of cache locality. My intuition is that it
> doesn't matter too much here, but you should probably double check by
> stress testing - run some workload with a relatively small zswap pool
> limit (i.e heavy global writeback), and see if there is any difference
> in performance.

I also think this shouldn't make a large difference. The global
shrinking work is already expensive, and I imagine that it exhausts
the caches anyway by iterating memcgs. A performance smoketest would
be reassuring for sure, but I believe it won't make a difference.

Keep in mind that even with WQ_UNBOUND, we prefer the local CPU (see
wq_select_unbound_cpu()), so it will take more than global writeback
to observe a difference. The local CPU must not be in
wq_unbound_cpumask, or CONFIG_DEBUG_WQ_FORCE_RR_CPU should be on.

>
> > if (!shrink_wq)
> > goto fallback_fail;
> >
> > --
> > 2.34.1
> >
>
> On a different note, I wonder if it would help to perform synchronous
> reclaim here instead. With our current design, the zswap store failure
> (due to global limit hit) would leave the incoming page going to swap
> instead, creating an LRU inversion. Not sure if that's ideal.

The global shrink path keeps reclaiming until zswap can accept again
(by default, that means reclaiming 10% of the total limit). I think
this is too expensive to be done synchronously.