Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization

From: Huang, Ying
Date: Tue Mar 26 2024 - 22:54:32 EST


Hi, Kairui,

Kairui Song <ryncsn@xxxxxxxxx> writes:

> From: Kairui Song <kasong@xxxxxxxxxxx>
>
> A month ago a bug was fixed for SWP_SYNCHRONOUS_IO swapin (swap cache
> bypass swapin):
> https://lore.kernel.org/linux-mm/20240219082040.7495-1-ryncsn@xxxxxxxxx/
>
> Because we have to spin on the swap map on race, and swap map is too small
> to contain more usable info, an ugly schedule_timeout_uninterruptible(1)
> is added. It's not the first time a hackish workaround was added for cache
> bypass swapin and not the last time. I did many experiments locally to
> see if the swap cache bypass path can be dropped while keeping the
> performance still comparable. And it seems doable.
>

In general, I think that it's a good idea to unify cache bypass swapin
and normal swapin. But I haven't dive into the implementation yet.

> This series does the following things:
> 1. Remove swap cache bypass completely.
> 2. Apply multiple optimizations after that, these optimizations are
> either undoable or very difficult to do without dropping the cache
> bypass swapin path.
> 3. Use swap cache as a synchronization layer, also unify some code
> with page cache (filemap).
>
> As a result, we have:
> 1. A comparable performance, some tests are even faster.
> 2. Multi-index support for swap cache.
> 3. Removed many hackish workarounds including above long tailing
> issue is gone.
>
> Sending this as RFC to collect some discussion, suggestion, or rejection
> early, this seems need to be split into multiple series, but the
> performance is not good until the last patch so I think start by
> seperating them may make this approach not very convincing. And there
> are still some (maybe further) TODO items and optimization space
> if we are OK with this approach.
>
> This is based on my another series, for reusing filemap code for swapcache:
> [PATCH v2 0/4] mm/filemap: optimize folio adding and splitting
> https://lore.kernel.org/linux-mm/20240325171405.99971-1-ryncsn@xxxxxxxxx/
>
> Patch 1/10, introduce a helper from filemap side to be used later.
> Patch 2/10, 3/10 are clean up and prepare for removing the swap cache
> bypass swapin path.
> Patch 4/10, removed the swap cache bypass swapin path, and the
> performance drop heavily (-28%).
> Patch 5/10, apply the first optimization after the removal, since all
> folios goes through swap cache now, there is no need to explicit shadow
> clearing any more.
> Patch 6/10, apply another optimization after clean up shadow clearing
> routines. Now swapcache is very alike page cache, so just reuse page
> cache code and we will have multi-index support. Shadow memory usage
> dropped a lot.
> Patch 7/10, just rename __read_swap_cache_async, it will be refactored
> and a key part of this series, and the naming is very confusing to me.
> Patch 8/10, make swap cache as a synchronization layer, introduce two
> helpers for adding folios to swap cache, caller will either succeed or
> get a folio to wait on.
> Patch 9/10, apply another optimization. With above two helpers, looking
> up of swapcache can be optimized and avoid false looking up, which
> helped improve the performance.
> Patch 10/10, apply a major optimization for SWP_SYNCHRONOUS_IO devices,
> after this commit, performance for simple swapin/swapout is basically
> same as before.
>
> Test 1, sequential swapin/out of 30G zero page on ZRAM:
>
> Before (us) After (us)
> Swapout: 33619409 33886008
> Swapin: 32393771 32465441 (- 0.2%)
> Swapout (THP): 7817909 6899938 (+11.8%)
> Swapin (THP) : 32452387 33193479 (- 2.2%)

If my understanding were correct, we don't have swapin (THP) support,
yet. Right?

> And after swapping out 30G with THP, the radix node usage dropped by a
> lot:
>
> Before: radix_tree_node 73728K
> After: radix_tree_node 7056K (-94%)

Good!

> Test 2:
> Mysql (16g buffer pool, 32G ZRAM SWAP, 4G memcg, Zswap disabled, THP never)
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
> --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
> --threads=48 --time=300 --report-interval=10 run
>
> Before: transactions: 4849.25 per sec
> After: transactions: 4849.40 per sec
>
> Test 3:
> Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP never)
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> echo 100 > /sys/module/zswap/parameters/max_pool_percent
> echo 1 > /sys/module/zswap/parameters/enabled
> echo y > /sys/module/zswap/parameters/shrinker_enabled
>
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
> --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
> --threads=48 --time=600 --report-interval=10 run
>
> Before: transactions: 1662.90 per sec
> After: transactions: 1726.52 per sec

3.8% improvement. Good!

> Test 4:
> Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP always)
> echo always > /sys/kernel/mm/transparent_hugepage/enabled
> echo 100 > /sys/module/zswap/parameters/max_pool_percent
> echo 1 > /sys/module/zswap/parameters/enabled
> echo y > /sys/module/zswap/parameters/shrinker_enabled
>
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
> --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
> --threads=48 --time=600 --report-interval=10 run
>
> Before: transactions: 2860.90 per sec.
> After: transactions: 2802.55 per sec.
>
> Test 5:
> Memtier / memcached (16G brd SWAP, 8G memcg, THP never):
>
> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &
>
> memtier_benchmark -S /tmp/memcached.socket \
> -P memcache_binary -n allkeys --key-minimum=1 \
> --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
> --ratio 1:0 --pipeline 8 -d 1000
>
> Before: 106730.31 Ops/sec
> After: 106360.11 Ops/sec
>
> Test 5:
> Memtier / memcached (16G brd SWAP, 8G memcg, THP always):
>
> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &
>
> memtier_benchmark -S /tmp/memcached.socket \
> -P memcache_binary -n allkeys --key-minimum=1 \
> --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
> --ratio 1:0 --pipeline 8 -d 1000
>
> Before: 83193.11 Ops/sec
> After: 82504.89 Ops/sec
>
> These tests are tested under heavy memory stress, and the performance
> seems basically same as before,very slightly better/worse for certain
> cases, the benefits of multi-index are basically erased by
> fragmentation and workingset nodes usage is slightly lower.
>
> Some (maybe further) TODO items if we are OK with this approach:
>
> - I see a slight performance regression for THP tests,
> could identify a clear hotspot with perf, my guess is the
> content on the xa_lock is an issue (we have a xa_lock for
> every 64M swap cache space), THP handling needs to take the lock
> longer than usual. splitting the xa_lock to be more
> fine-grained seems a good solution. We have
> SWAP_ADDRESS_SPACE_SHIFT = 14 which is not an optimal value.
> Considering XA_CHUNK_SHIFT is 6, we will have three layer of Xarray
> just for 2 extra bits. 12 should be better to always make use of
> the whole XA chunk and having two layers at most. But duplicated
> address_space struct also wastes more memory and cacheline.
> I see an observable performance drop (~3%) after change
> SWAP_ADDRESS_SPACE_SHIFT to 12. Might be a good idea to
> decouple swap cache xarray from address_space (there are
> too many user for swapcache, shouldn't come too dirty).
>
> - Actually after patch Patch 4/10, the performance is much better for
> tests limited with memory cgroup, until 10/10 applied the direct swap
> cache freeing logic for SWP_SYNCHRONOUS_IO swapin. Because if the swap
> device is not near full, swapin doesn't clear up the swapcache, so
> repeated swapout doesn't need to re-alloc a swap entry, make things
> faster. This may indicate that lazy freeing of swap cache could benifit
> certain workloads and may worth looking into later.
>
> - Now SWP_SYNCHRONOUS_IO swapin will bypass readahead and force drop
> swap cache after swapin is done, which can be cleaned up and optimized
> further after this patch. Device type will only determine the
> readahead logic, and swap cache drop check can be based purely on swap
> count.
>
> - Recent mTHP swapin/swapout series should have no fundamental
> conflict with this.
>
> Kairui Song (10):
> mm/filemap: split filemap storing logic into a standalone helper
> mm/swap: move no readahead swapin code to a stand-alone helper
> mm/swap: convert swapin_readahead to return a folio
> mm/swap: remove cache bypass swapin
> mm/swap: clean shadow only in unmap path
> mm/swap: switch to use multi index entries
> mm/swap: rename __read_swap_cache_async to swap_cache_alloc_or_get
> mm/swap: use swap cache as a synchronization layer
> mm/swap: delay the swap cache look up for swapin
> mm/swap: optimize synchronous swapin
>
> include/linux/swapops.h | 5 +-
> mm/filemap.c | 161 +++++++++-----
> mm/huge_memory.c | 78 +++----
> mm/internal.h | 2 +
> mm/memory.c | 133 ++++-------
> mm/shmem.c | 44 ++--
> mm/swap.h | 71 ++++--
> mm/swap_state.c | 478 +++++++++++++++++++++-------------------
> mm/swapfile.c | 64 +++---
> mm/vmscan.c | 8 +-
> mm/workingset.c | 2 +-
> mm/zswap.c | 4 +-
> 12 files changed, 540 insertions(+), 510 deletions(-)

--
Best Regards,
Huang, Ying