Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

From: Yosry Ahmed
Date: Mon Nov 20 2023 - 22:37:47 EST


On Mon, Nov 20, 2023 at 7:35 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Yosry Ahmed <yosryahmed@xxxxxxxxxx> writes:
>
> > On Mon, Nov 20, 2023 at 5:55 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@xxxxxxxxxx> writes:
> >>
> >> > On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >> >>
> >> >> Yosry Ahmed <yosryahmed@xxxxxxxxxx> writes:
> >> >>
> >> >> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >> >> >>
> >> >> >> Chris Li <chriscli@xxxxxxxxxx> writes:
> >> >> >>
> >> >> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> >> >> >> >>
> >> >> >> >> Not bypassing the swap slot cache, just make the callbacks to
> >> >> >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is
> >> >> >> >> no longer used and is entering the swap slot cache (i.e. when
> >> >> >> >> free_swap_slot() is called), instead of when draining the swap slot
> >> >> >> >> cache (i.e. when swap_range_free() is called). For all parts of MM
> >> >> >> >> outside of swap, the swap entry is freed when free_swap_slot() is
> >> >> >> >> called. We don't free it immediately because of caching, but this
> >> >> >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
> >> >> >> >
> >> >> >> > That will cancel the batching effect on the swap slot free, making the
> >> >> >> > common case for swapping faults take longer to complete, righ?
> >> >> >> > If I recall correctly, the uncharge is the expensive part of the swap
> >> >> >> > slot free operation.
> >> >> >> > I just want to figure out what we are trading off against. This is not
> >> >> >> > one side wins all situations.
> >> >> >>
> >> >> >> Per my understanding, we don't batch memcg uncharging in
> >> >> >> swap_entry_free() now. Although it's possible and may improve
> >> >> >> performance.
> >> >> >
> >> >> > Yes. It actually causes a long tail in swapin fault latency as Chris
> >> >> > discovered in our prod. I am wondering if doing the memcg uncharging
> >> >> > outside the slots cache will actually amortize the cost instead.
> >> >> >
> >> >> > Regardless of memcg charging, which is more complicated, I think we
> >> >> > should at least move the call to zswap_invalidate() before the slots
> >> >> > cache. I would prefer that we move everything non-swapfile specific
> >> >> > outside the slots cache layer (zswap_invalidate(),
> >> >> > arch_swap_invalidate_page(), clear_shadow_from_swap_cache(),
> >> >> > mem_cgroup_uncharge_swap(), ..). However, if some of those are
> >> >> > controversial, we can move some of them for now.
> >> >>
> >> >> That makes sense for me.
> >> >>
> >> >> > When draining free swap slots from the cache, swap_range_free() is
> >> >> > called with nr_entries == 1 anyway, so I can't see how any batching is
> >> >> > going on. If anything it should help amortize the cost.
> >> >>
> >> >> In swapcache_free_entries(), the sis->lock will be held to free multiple
> >> >> swap slots via swap_info_get_cont() if possible. This can reduce
> >> >> sis->lock contention.
> >> >
> >> > Ah yes that's a good point. Since most of these callbacks don't
> >> > actually access sis, but use the swap entry value itself, I am
> >> > guessing the reason we need to hold the lock for all these callbacks
> >> > is to prevent swapoff and swapon reusing the same swap entry on a
> >> > different swap device, right?
> >>
> >> In,
> >>
> >> swapcache_free_entries()
> >> swap_entry_free()
> >> swap_range_free()
> >>
> >> Quite some sis fields will be accessed.
> >
> > I wasn't referring to this code. I was what's preventing us from
> > moving the callbacks I mentioned outside the lock (zswap_invalidate(),
> > arch_swap_invalidate_page(), clear_shadow_from_swap_cache(),
> > mem_cgroup_uncharge_swap(), ..). I think most or all of them don't
> > really access sis, but perhaps they need the lock to ensure the swap
> > entry value does not get reused?
>
> In fact, the swap entries to be freed by swapcache_free_entries() is in
> a state that can not be freed by other path (including swapoff()). It's
> swap_map value is SWAP_HAS_CACHE, but we can not find folio in
> swap_address_space().

Interesting, it would be even nicer if we can move them outside the lock.

>
> To be honest, I don't know whether there are dependencies on sis->lock
> in these callbacks. You need to investigate them one by one.

Yeah moving these callbacks outside batching and the lock is very
intriguing but needs to be done carefully. We don't need to do it all
at once, we can start with zswap_invalidate() and move them as we see
fit. It would be nice if the code is refactored such that it's clear
what callbacks are made immediately when the entry is no longer used
and what callbacks are made when the swap slot is being freed.

>
> --
> Best Regards,
> Huang, Ying