Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff()

From: Yosry Ahmed
Date: Thu Jan 25 2024 - 03:00:13 EST


On Wed, Jan 24, 2024 at 9:29 PM Chris Li <chriscli@xxxxxxxxxx> wrote:
>
> Hi Yosry,
>
> On Tue, Jan 23, 2024 at 10:58 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> >
>
> > >
> > > Thanks for the great analysis, I missed the swapoff/swapon race myself :)
> > >
> > > The first solution that came to mind for me was refcounting the zswap
> > > tree with RCU with percpu-refcount, similar to how cgroup refs are
> > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think
> > > the percpu-refcount may be an overkill in terms of memory usage
> > > though. I think we can still do our own refcounting with RCU, but it
> > > may be more complicated.
> >
> > FWIW, I was able to reproduce the problem in a vm with the following
> > kernel diff:
>
> Thanks for the great find.
>
> I was worry about the usage after free situation in this email:
>
> https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc+Q0e9FZ9OFiUG6g@xxxxxxxxxxxxxx/
>
> Glad you are able to find a reproducible case. That is one of the
> reasons I change the free to invalidate entries in my xarray patch.
>
> I think the swap_off code should remove the entry from the tree, just
> wait for each zswap entry to drop to zero. Then free it.

This doesn't really help. The swapoff code is already removing all the
entries from the trees before zswap_swapoff() is called through
zswap_invalidate(). The race I described occurs because the writeback
code is accessing the entries through the LRU, not the tree. The
writeback code could have isolated a zswap entry from the LRU before
swapoff, then tried to access it after swapoff. Although the zswap
entry itself is referenced and safe to use, accessing the tree to grab
the tree lock and check if the entry is still in the tree is the
problem.

>
> That way you shouldn't need to refcount the tree. The tree refcount is
> effectively the combined refcount of all the zswap entries.

The problem is that given a zswap entry, you have no way to stabilize
the zswap tree before trying to deference it with the current code.
Chengming's suggestion of moving the swap cache pin before accessing
the tree seems like the right way to go.

> Having refcount on the tree would be very high contention.

A percpu refcount cannot be contended by definition :)