[PATCH RFC 0/1] mm/zswap: fix LRU reclaim for zswap writeback folios

From: chengming . zhou
Date: Fri Feb 09 2024 - 07:00:57 EST


From: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx>

The memory shrinker of zswap is special: it has to first allocate more
memory (folio) to free less memory (compressed copy), later that folio
can be freed after written back. So it's better to evict these folios
before trying to reclaim other folios.

Here may come some problems of LRU management:

1. zswap_writeback_entry() reuses the __read_swap_cache_async(), which
is normally only used in the swapin context. This may cause refault
accounting and active/workingset information of the folio to be wrong.

For example, zswap shrinker writeback try to reclaim the folio, but
workingset_refault() mark it active to put it at head of active list.

2. folio_end_writeback() will check PG_reclaim flag which we did set
in zswap_writeback_entry(), to try to rotate the folio to the tail
of inactive list, to speed up its reclaim.

But folio_rotate_reclaimable() won't work as we thought, actually
all LRU move interfaces may don't work when the folio is isolated
from LRU. (per-cpu add batch is somewhat like isolated from LRU)

So when folio_end_writeback() calls folio_rotate_reclaimable(),
it won't do nothing but just clear PG_reclaim flag if that folio
is isolated (in per-cpu add batch or isolated by vmscan shrinker)

3. so the final result is the folio that has been written back and is
expected to be evicted, but now is not at tail of inactive list.

Meanwhile vmscan shrinker may try to evict other folios to cause
more refaults. There is a report [1] of this problem.

We should handle these cases better. First of all, we should consider
when and where to put these folios on LRU.

1. after alloc: now we use folio_add_lru() to put folio on local batch,
so it will be put at head of inactive/active list when batch drain.

2. after writeback: clear PG_reclaim and folio_rotate_reclaimable().
- after add batch drain: rotate successfully to tail of inactive list
- before add batch drain: do nothing since folio is not on LRU list

So these are two main time points we care about: the first is somewhat
correct IMHO since the folio is under writeback and has PG_reclaim set,
it may confuse shrinker if we put those to the tail of inactive list
too early. If we really want to put it to tail, we can easily introduce
another folio_add_lru_tail() to put on a new local lru_add_tail batch.

The second is where we need to improve, we should rotate it to tail
even when folio is in local batch. Since we can't manipulate folio LRU
status when it's isolated in local batch, an obvious fix is to use
folio flag to tell later lru_add_fn() where the folio should be put:
active or inactive, head or tail.

But the problem is that PG_readahead is the alias of PG_reclaim, we
can't put readahead folio to the tail of inactive list obviously.

So this patch changes folio_rotate_reclaimable() to queue to local
rotate batch even when !PG_lru at first, hoping that:

- folio_end_writeback() finish on the same cpu with lru_add batch,
so it must be handled after the lru_add batch, after which it will
see PG_lru and successfully rotate it to tail of inactive list.

- even folio_end_writeback() is not finished on the same cpu, there
maybe a big chance that rotate batch is handled after add batch.

Testing kernel build in tmpfs with memory.max = 2GB.
(zswap shrinker and writeback enabled with one 50GB swapfile)

mm-unstable-hot zswap-lru-reclaim
real 63.34 62.72
user 1063.20 1060.30
sys 272.04 256.14
workingset_refault_anon 2103297.00 1788155.80
workingset_refault_file 28638.20 39249.40
workingset_activate_anon 746134.00 695435.40
workingset_activate_file 4344.60 4255.80
workingset_restore_anon 653163.80 605315.60
workingset_restore_file 1079.00 883.00
workingset_nodereclaim 0.00 0.00
pgscan 12971305.60 12730331.20
pgscan_kswapd 0.00 0.00
pgscan_direct 12971305.60 12730331.20
pgscan_khugepaged 0.00 0.00

We can see that refault and sys cpu have some improvements.

As for the workingset_refault() caused by zswap writeback, maybe we
should remove it in zswap writeback case, but there are more pgscan
and some regression. I don't know why, so just leave it as it is.

This is RFC, any comment or discussion is welcome! Thanks!

[1] https://lore.kernel.org/all/20231024142706.195517-1-hezhongkun.hzk@xxxxxxxxxxxxx/

Chengming Zhou (1):
mm/swap: queue reclaimable folio to local rotate batch when
!folio_test_lru()

mm/swap.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

--
2.40.1