Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree

From: Johannes Weiner
Date: Thu Jan 18 2024 - 13:07:11 EST

Next message: Nhat Pham: "Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call"
Previous message: Yosry Ahmed: "Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call"
In reply to: Yosry Ahmed: "Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree"
Next in thread: Yosry Ahmed: "Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Jan 18, 2024 at 09:30:12AM -0800, Yosry Ahmed wrote:
> On Thu, Jan 18, 2024 at 7:34 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote:
> > > On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
> > > <zhouchengming@xxxxxxxxxxxxx> wrote:
> > > >
> > > > When testing the zswap performance by using kernel build -j32 in a tmpfs
> > > > directory, I found the scalability of zswap rb-tree is not good, which
> > > > is protected by the only spinlock. That would cause heavy lock contention
> > > > if multiple tasks zswap_store/load concurrently.
> > > >
> > > > So a simple solution is to split the only one zswap rb-tree into multiple
> > > > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> > > > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> > > >
> > > > Although this method can't solve the spinlock contention completely, it
> > > > can mitigate much of that contention. Below is the results of kernel build
> > > > in tmpfs with zswap shrinker enabled:
> > > >
> > > > linux-next zswap-lock-optimize
> > > > real 1m9.181s 1m3.820s
> > > > user 17m44.036s 17m40.100s
> > > > sys 7m37.297s 4m54.622s
> > > >
> > > > So there are clearly improvements. And it's complementary with the ongoing
> > > > zswap xarray conversion by Chris. Anyway, I think we can also merge this
> > > > first, it's complementary IMHO. So I just refresh and resend this for
> > > > further discussion.
> > >
> > > The reason why I think we should wait for the xarray patch(es) is
> > > there is a chance we may see less improvements from splitting the tree
> > > if it was an xarray. If we merge this series first, there is no way to
> > > know.
> >
> > I mentioned this before, but I disagree quite strongly with this
> > general sentiment.
> >
> > Chengming's patches are simple, mature, and have convincing
> > numbers. IMO it's poor form to hold something like that for "let's see
> > how our other experiment works out". The only exception would be if we
> > all agree that the earlier change flies in the face of the overall
> > direction we want to pursue, which I don't think is the case here.
>
> My intention was not to delay merging these patches until the xarray
> patches are merged in. It was only to wait until the xarray patches
> are *posted*, so that we can redo the testing on top of them and
> verify that the gains are still there. That should have been around
> now, but the xarray patches were posted in a form that does not allow
> this testing (because we still have a lock on the read path), so I am
> less inclined.
>
> My rationale was that if the gains from splitting the tree become
> minimal after we switch to an xarray, we won't know. It's more
> difficult to remove optimizations than to add them, because we may
> cause a regression. I am kind of paranoid about having code sitting
> around that we don't have full information about how much it's needed.

Yeah I understand that fear.

I expect the splitting to help more than the move to xarray because
it's the writes that are hot. Luckily in this case it should be fairly
easy to differential-test after it's been merged by changing that tree
lookup macro/function locally to always return &trees[type][0], right?

> In this case, I suppose we can redo the testing (1 tree vs. split
> trees) once the xarray patches are in a testable form, and before we
> have formed any strong dependencies on the split trees (we have time
> until v6.9 is released, I assume).
>
> How about that?

That sounds reasonable.

> > With the xarray we'll still have a per-swapfile lock for writes. That
> > lock is the reason SWAP_ADDRESS_SPACE segmentation was introduced for
> > the swapcache in the first place. Lockless reads help of course, but
> > read-only access to swap are in the minority - stores will write, and
> > loads are commonly followed by invalidations. Somebody already went
> > through the trouble of proving that xarrays + segmentation are worth
> > it for swap load and store access patterns. Why dismiss that?
>
> Fair point, although I think the swapcache lock may be more contended
> than the zswap tree lock.

Right, it has two updates for each transition, compared to the one for
zswap. But we know that in a concurrent system under pressure a
globally shared swap lock will hurt. There is a history in Chengming's
numbers, your previous patch to split the zpools,
235b62176712b970c815923e36b9a9cc05d4d901 etc.

> > So my vote is that we follow the ususal upstreaming process here:
> > merge the ready patches now, and rebase future work on top of it.
>
> No objections given the current state of the xarray patches as I
> mentioned earlier, but I prefer we redo the testing once possible with
> the xarray.

Cool, sounds good to me.

Next message: Nhat Pham: "Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call"
Previous message: Yosry Ahmed: "Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call"
In reply to: Yosry Ahmed: "Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree"
Next in thread: Yosry Ahmed: "Re: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]