Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff()

From: Chris Li
Date: Thu Jan 25 2024 - 00:44:25 EST


On Tue, Jan 23, 2024 at 11:21 PM Chengming Zhou
<zhouchengming@xxxxxxxxxxxxx> wrote:
> > Thanks for the great analysis, I missed the swapoff/swapon race myself :)
> >
> > The first solution that came to mind for me was refcounting the zswap
> > tree with RCU with percpu-refcount, similar to how cgroup refs are
> > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think
> > the percpu-refcount may be an overkill in terms of memory usage
> > though. I think we can still do our own refcounting with RCU, but it
> > may be more complicated.
> Hello,
>
> I also thought about this problem for some time, maybe something like below
> can be changed to fix it? It's likely I missed something, just some thoughts.
>
> IMHO, the problem is caused by the different way in which we use zswap entry
> in the writeback, that should be much like zswap_load().

It is possible to keep the behavior difference and get the tree right.
Follow a few rules:
1) Always bump the zswap entry refcount after finding the entry
(inside the RCU read lock)
2) Always free entry after refcount drops to zero.
3) Swap off should just wait for the each entry ref count drop to zero
(or 1 including the swap off code holding one) then free it. Swap off
is the slow path anyway.
BTW, the swap off is where swap device locks get a lot of contention.

>
> The zswap_load() comes in with the folio locked in swap cache, so it has
> stable zswap tree to search and lock... But in writeback case, we don't,
> shrink_memcg_cb() comes in with only a zswap entry with lru list lock held,
> then release lru lock to get tree lock, which maybe freed already.
>
> So we should change here, we read swpentry from entry with lru list lock held,
> then release lru lock, to try to lock corresponding folio in swap cache,
> if we success, the following things is much the same like zswap_load().
> We can get tree lock, to recheck the invalidate race, if no race happened,
> we can make sure the entry is still right and get refcount of it, then
> release the tree lock.
>
> The main differences between this writeback with zswap_load() is the handling
> of lru entry and the tree lifetime. The whole zswap_load() function has the
> stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half
> after __swap_writepage() since we unlock the folio after that. So we can't
> reference the tree after that.
>
> This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early
> in tree lock, since thereafter writeback can't fail. BTW, I think we should
> also zswap_invalidate_entry() early in zswap_load() and only support the
> zswap_exclusive_loads_enabled mode, but that's another topic.
>
> The second difference is the handling of lru entry, which is easy that we
> just zswap_lru_del() in tree lock.
>
> So no any path can reference the entry from tree or lru after we release
> the tree lock, so we can just zswap_free_entry() after writeback.
>
> Thanks!
>
> // lru list lock held
> shrink_memcg_cb()
> swpentry = entry->swpentry
> // Don't isolate entry from lru list here, just use list_lru_putback()
> spin_unlock(lru list lock)
>
> folio = __read_swap_cache_async(swpentry)
> if (!folio)
> return
>
> if (!folio_was_allocated)
> folio_put(folio)
> return
>
> // folio is locked, swapcache is secured against swapoff
> tree = get tree from swpentry
> spin_lock(&tree->lock)

That will not work well with zswap to xarray change. We want to remove
the tree lock and only use the xarray lock.
The lookup should just hold xarray RCU read lock and return the entry
with ref count increased.

Chris

>
> // check invalidate race? No
> if (entry == zswap_rb_search())
>
> // so we can make sure this entry is still right
> // zswap_invalidate_entry() since the below writeback can't fail
> zswap_entry_get(entry)
> zswap_invalidate_entry(tree, entry)
>
> // remove from lru list
> zswap_lru_del()
>
> spin_unlock(&tree->lock)
>
> __zswap_load()
>
> __swap_writepage() // folio unlock
> folio_put(folio)
>
> // entry is safe to free, since it's removed from tree and lru above
> zswap_free_entry(entry)
>
>