Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

From: Nhat Pham
Date: Mon Nov 13 2023 - 10:12:16 EST


On Mon, Nov 13, 2023 at 8:06 AM Zhongkun He
<hezhongkun.hzk@xxxxxxxxxxxxx> wrote:
>
> I recently found two scenarios where zswap entry could not be
> released, which will cause shrink_worker and active recycling
> to fail.
> 1)The swap entry has been freed, but cached in swap_slots_cache,
> no swap cache and swapcount=0.
> 2)When the option zswap_exclusive_loads_enabled disabled and
> zswap_load completed(page in swap_cache and swapcount = 0).
>
> The above two cases need to be determined by swapcount=0,
> fix it.
>
> Signed-off-by: Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx>
> ---
> mm/zswap.c | 35 +++++++++++++++++++++++++----------
> 1 file changed, 25 insertions(+), 10 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 74411dfdad92..db95491bcdd5 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1063,11 +1063,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> struct mempolicy *mpol;
> struct scatterlist input, output;
> struct crypto_acomp_ctx *acomp_ctx;
> + struct swap_info_struct *si;
> struct zpool *pool = zswap_find_zpool(entry);
> bool page_was_allocated;
> u8 *src, *tmp = NULL;
> unsigned int dlen;
> - int ret;
> + int ret = 0;
> struct writeback_control wbc = {
> .sync_mode = WB_SYNC_NONE,
> };
> @@ -1082,16 +1083,30 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> mpol = get_task_policy(current);
> page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
> NO_INTERLEAVE_INDEX, &page_was_allocated);
> - if (!page) {
> + if (!page)
> ret = -ENOMEM;
> - goto fail;
> - }
> -
> - /* Found an existing page, we raced with load/swapin */
> - if (!page_was_allocated) {
> + else if (!page_was_allocated) {
> + /* Found an existing page, we raced with load/swapin */
> put_page(page);
> ret = -EEXIST;
> - goto fail;
> + }
> +
> + if (ret) {
> + si = get_swap_device(swpentry);
> + if (!si)
> + goto out;
> +
> + /* Two cases to directly release zswap_entry.
> + * 1) -ENOMEM,if the swpentry has been freed, but cached in
> + * swap_slots_cache(no page and swapcount = 0).
> + * 2) -EEXIST, option zswap_exclusive_loads_enabled disabled and
> + * zswap_load completed(page in swap_cache and swapcount = 0).
> + */

These two cases should not count as "successful writeback" right?

I'm slightly biased of course, since my zswap shrinker depends on this
as one of the potential signals for over-shrinking - but that aside, I think
that this constitutes a failed writeback (i.e should not increment writeback
counter, and the limit-based reclaim should try again etc.). If anything,
it will make it incredibly confusing for users.

For instance, we were trying to estimate the number of zswap store
fails by subtracting the writeback count from the overall pswpout, and
this could throw us off by inflating the writeback count, and deflating
the zswap store failure count as a result.

Regarding the second case specifically, I thought that was the point of
having zswap_exclusive_loads_enabled disabled - i.e still keeps a copy
around in the zswap pool even after a completed zswap_load? Based
on the Kconfig documentation:

"This avoids having two copies of the same page in memory
(compressed and uncompressed) after faulting in a page from zswap.
The cost is that if the page was never dirtied and needs to be
swapped out again, it will be re-compressed."

> + if (!swap_swapcount(si, swpentry))
> + ret = 0;
> +
> + put_swap_device(si);
> + goto out;
> }
>
> /*
> @@ -1106,7 +1121,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> spin_unlock(&tree->lock);
> delete_from_swap_cache(page_folio(page));
> ret = -ENOMEM;
> - goto fail;
> + goto out;
> }
> spin_unlock(&tree->lock);
>
> @@ -1151,7 +1166,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>
> return ret;
>
> -fail:
> +out:
> if (!zpool_can_sleep_mapped(pool))
> kfree(tmp);
>
> --
> 2.25.1
>