[RFC PATCH 00/10] mm/swap: always use swap cache for synchronization

From: Kairui Song
Date: Tue Mar 26 2024 - 15:04:24 EST


From: Kairui Song <kasong@xxxxxxxxxxx>

A month ago a bug was fixed for SWP_SYNCHRONOUS_IO swapin (swap cache
bypass swapin):
https://lore.kernel.org/linux-mm/20240219082040.7495-1-ryncsn@xxxxxxxxx/

Because we have to spin on the swap map on race, and swap map is too small
to contain more usable info, an ugly schedule_timeout_uninterruptible(1)
is added. It's not the first time a hackish workaround was added for cache
bypass swapin and not the last time. I did many experiments locally to
see if the swap cache bypass path can be dropped while keeping the
performance still comparable. And it seems doable.

This series does the following things:
1. Remove swap cache bypass completely.
2. Apply multiple optimizations after that, these optimizations are
either undoable or very difficult to do without dropping the cache
bypass swapin path.
3. Use swap cache as a synchronization layer, also unify some code
with page cache (filemap).

As a result, we have:
1. A comparable performance, some tests are even faster.
2. Multi-index support for swap cache.
3. Removed many hackish workarounds including above long tailing
issue is gone.

Sending this as RFC to collect some discussion, suggestion, or rejection
early, this seems need to be split into multiple series, but the
performance is not good until the last patch so I think start by
seperating them may make this approach not very convincing. And there
are still some (maybe further) TODO items and optimization space
if we are OK with this approach.

This is based on my another series, for reusing filemap code for swapcache:
[PATCH v2 0/4] mm/filemap: optimize folio adding and splitting
https://lore.kernel.org/linux-mm/20240325171405.99971-1-ryncsn@xxxxxxxxx/

Patch 1/10, introduce a helper from filemap side to be used later.
Patch 2/10, 3/10 are clean up and prepare for removing the swap cache
bypass swapin path.
Patch 4/10, removed the swap cache bypass swapin path, and the
performance drop heavily (-28%).
Patch 5/10, apply the first optimization after the removal, since all
folios goes through swap cache now, there is no need to explicit shadow
clearing any more.
Patch 6/10, apply another optimization after clean up shadow clearing
routines. Now swapcache is very alike page cache, so just reuse page
cache code and we will have multi-index support. Shadow memory usage
dropped a lot.
Patch 7/10, just rename __read_swap_cache_async, it will be refactored
and a key part of this series, and the naming is very confusing to me.
Patch 8/10, make swap cache as a synchronization layer, introduce two
helpers for adding folios to swap cache, caller will either succeed or
get a folio to wait on.
Patch 9/10, apply another optimization. With above two helpers, looking
up of swapcache can be optimized and avoid false looking up, which
helped improve the performance.
Patch 10/10, apply a major optimization for SWP_SYNCHRONOUS_IO devices,
after this commit, performance for simple swapin/swapout is basically
same as before.

Test 1, sequential swapin/out of 30G zero page on ZRAM:

Before (us) After (us)
Swapout: 33619409 33886008
Swapin: 32393771 32465441 (- 0.2%)
Swapout (THP): 7817909 6899938 (+11.8%)
Swapin (THP) : 32452387 33193479 (- 2.2%)

And after swapping out 30G with THP, the radix node usage dropped by a
lot:

Before: radix_tree_node 73728K
After: radix_tree_node 7056K (-94%)

Test 2:
Mysql (16g buffer pool, 32G ZRAM SWAP, 4G memcg, Zswap disabled, THP never)
sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
--mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
--threads=48 --time=300 --report-interval=10 run

Before: transactions: 4849.25 per sec
After: transactions: 4849.40 per sec

Test 3:
Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP never)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo 100 > /sys/module/zswap/parameters/max_pool_percent
echo 1 > /sys/module/zswap/parameters/enabled
echo y > /sys/module/zswap/parameters/shrinker_enabled

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
--mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
--threads=48 --time=600 --report-interval=10 run

Before: transactions: 1662.90 per sec
After: transactions: 1726.52 per sec

Test 4:
Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP always)
echo always > /sys/kernel/mm/transparent_hugepage/enabled
echo 100 > /sys/module/zswap/parameters/max_pool_percent
echo 1 > /sys/module/zswap/parameters/enabled
echo y > /sys/module/zswap/parameters/shrinker_enabled

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
--mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
--threads=48 --time=600 --report-interval=10 run

Before: transactions: 2860.90 per sec.
After: transactions: 2802.55 per sec.

Test 5:
Memtier / memcached (16G brd SWAP, 8G memcg, THP never):

memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &

memtier_benchmark -S /tmp/memcached.socket \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
--ratio 1:0 --pipeline 8 -d 1000

Before: 106730.31 Ops/sec
After: 106360.11 Ops/sec

Test 5:
Memtier / memcached (16G brd SWAP, 8G memcg, THP always):

memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &

memtier_benchmark -S /tmp/memcached.socket \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
--ratio 1:0 --pipeline 8 -d 1000

Before: 83193.11 Ops/sec
After: 82504.89 Ops/sec

These tests are tested under heavy memory stress, and the performance
seems basically same as before,very slightly better/worse for certain
cases, the benefits of multi-index are basically erased by
fragmentation and workingset nodes usage is slightly lower.

Some (maybe further) TODO items if we are OK with this approach:

- I see a slight performance regression for THP tests,
could identify a clear hotspot with perf, my guess is the
content on the xa_lock is an issue (we have a xa_lock for
every 64M swap cache space), THP handling needs to take the lock
longer than usual. splitting the xa_lock to be more
fine-grained seems a good solution. We have
SWAP_ADDRESS_SPACE_SHIFT = 14 which is not an optimal value.
Considering XA_CHUNK_SHIFT is 6, we will have three layer of Xarray
just for 2 extra bits. 12 should be better to always make use of
the whole XA chunk and having two layers at most. But duplicated
address_space struct also wastes more memory and cacheline.
I see an observable performance drop (~3%) after change
SWAP_ADDRESS_SPACE_SHIFT to 12. Might be a good idea to
decouple swap cache xarray from address_space (there are
too many user for swapcache, shouldn't come too dirty).

- Actually after patch Patch 4/10, the performance is much better for
tests limited with memory cgroup, until 10/10 applied the direct swap
cache freeing logic for SWP_SYNCHRONOUS_IO swapin. Because if the swap
device is not near full, swapin doesn't clear up the swapcache, so
repeated swapout doesn't need to re-alloc a swap entry, make things
faster. This may indicate that lazy freeing of swap cache could benifit
certain workloads and may worth looking into later.

- Now SWP_SYNCHRONOUS_IO swapin will bypass readahead and force drop
swap cache after swapin is done, which can be cleaned up and optimized
further after this patch. Device type will only determine the
readahead logic, and swap cache drop check can be based purely on swap
count.

- Recent mTHP swapin/swapout series should have no fundamental
conflict with this.

Kairui Song (10):
mm/filemap: split filemap storing logic into a standalone helper
mm/swap: move no readahead swapin code to a stand-alone helper
mm/swap: convert swapin_readahead to return a folio
mm/swap: remove cache bypass swapin
mm/swap: clean shadow only in unmap path
mm/swap: switch to use multi index entries
mm/swap: rename __read_swap_cache_async to swap_cache_alloc_or_get
mm/swap: use swap cache as a synchronization layer
mm/swap: delay the swap cache look up for swapin
mm/swap: optimize synchronous swapin

include/linux/swapops.h | 5 +-
mm/filemap.c | 161 +++++++++-----
mm/huge_memory.c | 78 +++----
mm/internal.h | 2 +
mm/memory.c | 133 ++++-------
mm/shmem.c | 44 ++--
mm/swap.h | 71 ++++--
mm/swap_state.c | 478 +++++++++++++++++++++-------------------
mm/swapfile.c | 64 +++---
mm/vmscan.c | 8 +-
mm/workingset.c | 2 +-
mm/zswap.c | 4 +-
12 files changed, 540 insertions(+), 510 deletions(-)

--
2.43.0