Re: [PATCH 0/2] RFC: zswap tree use xarray instead of RB tree

From: Chris Li
Date: Fri Jan 19 2024 - 06:59:30 EST


On Fri, Jan 19, 2024 at 3:12 AM Chengming Zhou
<zhouchengming@xxxxxxxxxxxxx> wrote:
>
> On 2024/1/19 18:26, Chris Li wrote:
> > On Thu, Jan 18, 2024 at 10:19 PM Chengming Zhou
> > <zhouchengming@xxxxxxxxxxxxx> wrote:
> >>
> >> On 2024/1/19 12:59, Chris Li wrote:
> >>> On Wed, Jan 17, 2024 at 11:35 PM Chengming Zhou
> >>> <zhouchengming@xxxxxxxxxxxxx> wrote:
> >>>
> >>>>>>> mm-stable zswap-split-tree zswap-xarray
> >>>>>>> real 1m10.442s 1m4.157s 1m9.962s
> >>>>>>> user 17m48.232s 17m41.477s 17m45887s
> >>>>>>> sys 8m13.517s 5m2.226s 7m59.305s
> >>>>>>>
> >>>>>>> Looks like the contention of concurrency is still there, I haven't
> >>>>>>> look into the code yet, will review it later.
> >>>>>
> >>>>> Thanks for the quick test. Interesting to see the sys usage drop for
> >>>>> the xarray case even with the spin lock.
> >>>>> Not sure if the 13 second saving is statistically significant or not.
> >>>>>
> >>>>> We might need to have both xarray and split trees for the zswap. It is
> >>>>> likely removing the spin lock wouldn't be able to make up the 35%
> >>>>> difference. That is just my guess. There is only one way to find out.
> >>>>
> >>>> Yes, I totally agree with this! IMHO, concurrent zswap_store paths still
> >>>> have to contend for the xarray spinlock even though we would have converted
> >>>> the rb-tree to the xarray structure at last. So I think we should have both.
> >>>>
> >>>>>
> >>>>> BTW, do you have a script I can run to replicate your results?
> >>>
> >>> Hi Chengming,
> >>>
> >>> Thanks for your script.
> >>>
> >>>>
> >>>> ```
> >>>> #!/bin/bash
> >>>>
> >>>> testname="build-kernel-tmpfs"
> >>>> cgroup="/sys/fs/cgroup/$testname"
> >>>>
> >>>> tmpdir="/tmp/vm-scalability-tmp"
> >>>> workdir="$tmpdir/$testname"
> >>>>
> >>>> memory_max="$((2 * 1024 * 1024 * 1024))"
> >>>>
> >>>> linux_src="/root/zcm/linux-6.6.tar.xz"
> >>>> NR_TASK=32
> >>>>
> >>>> swapon ~/zcm/swapfile
> >>>
> >>> How big is your swapfile here?
> >>
> >> The swapfile is big enough here, I use a 50GB swapfile.
> >
> > Thanks,
> >
> >>
> >>>
> >>> It seems you have only one swapfile there. That can explain the contention.
> >>> Have you tried multiple swapfiles for the same test?
> >>> That should reduce the contention without using your patch.
> >> Do you mean to have many 64MB swapfiles to swapon at the same time?
> >
> > 64MB is too small. There are limits to MAX_SWAPFILES. It is less than
> > (32 - n) swap files.
> > If you want to use 50G swap space, you can have MAX_SWAPFILES, each
> > swapfile 50GB / MAX_SWAPFILES.
>
> Right.
>
> >
> >> Maybe it's feasible to test,
> >
> > Of course it is testable, I am curious to see the test results.
> >
> >> I'm not sure how swapout will choose.
> >
> > It will rotate through the same priority swap files first.
> > swapfile.c: get_swap_pages().
> >
> >> But in our usecase, we normally have only one swapfile.
> >
> > Is there a good reason why you can't use more than one swapfile?
>
> I think no, but it seems an unneeded change/burden to our admin.
> So I just tested and optimized for the normal case.

I understand. Just saying it is not really a kernel limitation per say.
I blame the user space :-)

>
> > One swapfile will not take the full advantage of the existing code.
> > Even if you split the zswap trees within a swapfile. With only one
> > swapfile, you will still be having lock contention on "(struct
> > swap_info_struct).lock".
> > It is one lock per swapfile.
> > Using more than one swap file should get you better results.
>
> IIUC, we already have the per-cpu swap entry cache to not contend for
> this lock? And I don't see much hot of this lock in the testing.

Yes. The swap entry cache helps. The cache batching also causes other
problems, e.g. the long tail in swap faults handling.
Shameless plug, I have a patch posted earlier to address the swap
fault long tail latencies.

https://lore.kernel.org/linux-mm/20231221-async-free-v1-1-94b277992cb0@xxxxxxxxxx/T/

Chris