Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff()

From: Yosry Ahmed
Date: Thu Jan 25 2024 - 02:54:55 EST


> Hello,
>
> I also thought about this problem for some time, maybe something like below
> can be changed to fix it? It's likely I missed something, just some thoughts.
>
> IMHO, the problem is caused by the different way in which we use zswap entry
> in the writeback, that should be much like zswap_load().
>
> The zswap_load() comes in with the folio locked in swap cache, so it has
> stable zswap tree to search and lock... But in writeback case, we don't,
> shrink_memcg_cb() comes in with only a zswap entry with lru list lock held,
> then release lru lock to get tree lock, which maybe freed already.
>
> So we should change here, we read swpentry from entry with lru list lock held,
> then release lru lock, to try to lock corresponding folio in swap cache,
> if we success, the following things is much the same like zswap_load().
> We can get tree lock, to recheck the invalidate race, if no race happened,
> we can make sure the entry is still right and get refcount of it, then
> release the tree lock.

Hmm I think you may be onto something here. Moving the swap cache
allocation ahead before referencing the tree should give us the same
guarantees as zswap_load() indeed. We can also consolidate the
invalidate race checks (right now we have one in shrink_memcg_cb() and
another one inside zswap_writeback_entry()).

We will have to be careful about the error handling path to make sure
we delete the folio from the swap cache only after we know the tree
won't be referenced anymore. Anyway, I think this can work.

On a separate note, I think there is a bug in zswap_writeback_entry()
when we delete a folio from the swap cache. I think we are missing a
folio_unlock() there.

>
> The main differences between this writeback with zswap_load() is the handling
> of lru entry and the tree lifetime. The whole zswap_load() function has the
> stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half
> after __swap_writepage() since we unlock the folio after that. So we can't
> reference the tree after that.
>
> This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early
> in tree lock, since thereafter writeback can't fail. BTW, I think we should
> also zswap_invalidate_entry() early in zswap_load() and only support the
> zswap_exclusive_loads_enabled mode, but that's another topic.

zswap_invalidate_entry() actually doesn't seem to be using the tree at all.

>
> The second difference is the handling of lru entry, which is easy that we
> just zswap_lru_del() in tree lock.

Why do we need zswap_lru_del() at all? We should have already isolated
the entry at that point IIUC.