Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff()

From: Yosry Ahmed
Date: Tue Jan 23 2024 - 16:05:50 EST


On Tue, Jan 23, 2024 at 12:30 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Tue, Jan 23, 2024 at 7:55 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> >
> > On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner <hannes@cmpxchgorg> wrote:
> > >
> > > On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote:
> > > > On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > > > >
> > > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote:
> > > > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is
> > > > > > called for all swap entries before zswap_swapoff() is called. This means
> > > > > > that all zswap entries should already be removed from the tree. Simplify
> > > > > > zswap_swapoff() by removing the tree cleanup loop, and leaving an
> > > > > > assertion in its place.
> > > > > >
> > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > > > >
> > > > > Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> > > > >
> > > > > That's a great simplification.
> > > > >
> > > > > Removing the tree->lock made me double take, but at this point the
> > > > > swapfile and its cache should be fully dead and I don't see how any of
> > > > > the zswap operations that take tree->lock could race at this point.
> > > >
> > > > It took me a while staring at the code to realize this loop is pointless.
> > > >
> > > > However, while I have your attention on the swapoff path, there's a
> > > > slightly irrelevant problem that I think might be there, but I am not
> > > > sure.
> > > >
> > > > It looks to me like swapoff can race with writeback, and there may be
> > > > a chance of UAF for the zswap tree. For example, if zswap_swapoff()
> > > > races with shrink_memcg_cb(), I feel like we may free the tree as it
> > > > is being used. For example if zswap_swapoff()->kfree(tree) happen
> > > > right before shrink_memcg_cb()->list_lru_isolate(l, item).
> > > >
> > > > Please tell me that I am being paranoid and that there is some
> > > > protection against zswap writeback racing with swapoff. It feels like
> > > > we are very careful with zswap entries refcounting, but not with the
> > > > zswap tree itself.
> > >
> > > Hm, I don't see how.
> > >
> > > Writeback operates on entries from the LRU. By the time
> > > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should
> > > will have emptied out the LRU and tree.
> > >
> > > Writeback could have gotten a refcount to the entry and dropped the
> > > tree->lock. But then it does __read_swap_cache_async(), and while
> > > holding the page lock checks the tree under lock once more; if that
> > > finds the entry valid, it means try_to_unuse() hasn't started on this
> > > page yet, and would be held up by the page lock/writeback state.
> >
> > Consider the following race:
> >
> > CPU 1 CPU 2
> > # In shrink_memcg_cb() # In swap_off
> > list_lru_isolate()
> > zswap_invalidate()
> > ..
> > zswap_swapoff() -> kfree(tree)
> > spin_lock(&tree->lock);
> >
> > Isn't this a UAF or am I missing something here?
>
> I need to read this code closer. But this smells like a race to me as well.
>
> Long term speaking, I think decoupling swap and zswap will fix this,
> no? We won't need to kfree(tree) inside swapoff. IOW, if we have a
> single zswap tree that is not tied down to any swapfile, then we can't
> have this race. There might be other races introduced by the
> decoupling that I might have not foreseen tho :)

Ideally yes, it should make the zswap <-> swapfile interaction more
straightforward. Ideally in that world, writeback is handled in the
swap core and is just moving an entry from one backend to another, so
synchronization with things like swapoff should be easier. Also, I
think in that world we don't need a zswap tree to begin with, there's
a global tree(s) that maps a swap index directly to a swapfile swap
entry or a zswap entry.

>
> Short term, no clue hmm. Let me think a bit more about this.

See my response to Johannes, I think we may be able to fix this with
refcounting + RCU.