Re: [PATCH v4] zswap: replace RB tree with xarray

From: Yosry Ahmed
Date: Thu Mar 07 2024 - 04:07:15 EST


[..]
> > > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> > > -{
> > > - rb_erase(&entry->rbnode, root);
> > > - RB_CLEAR_NODE(&entry->rbnode);
> > > + e = xa_store(tree, offset, entry, GFP_KERNEL);
> > > + err = xa_err(e);
> > > +
> > > + if (err) {
> > > + e = xa_erase(tree, offset);
> > > + if (err == -ENOMEM)
> > > + zswap_reject_alloc_fail++;
> > > + else
> > > + zswap_reject_xarray_fail++;
> >
> > I think this is too complicated, and as Chengming pointed out, I believe
> > we can use xa_store() directly in zswap_store().
>
> Sure.
>
> > I am also not sure what the need for zswap_reject_xarray_fail is. Are
> > there any reasons why the store here can fail other than -ENOMEM? The
> > docs say the only other option is -EINVAL, and looking at __xa_store(),
> > it seems like this is only possible if xa_is_internal() is true (which
> > means we are not passing in a properly aligned pointer IIUC).
>
> Because the xa_store document said it can return two error codes. I
> see zswap try to classify the error count it hit, that is why I add
> the zswap_reject_xarray_fail.

Right, but I think we should not get -EINVAL in this case. I think it
would be more appropriate to have WARN_ON() or VM_WARN_ON() in this
case?

[..]
> > > @@ -1113,7 +1068,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
> > > static int zswap_writeback_entry(struct zswap_entry *entry,
> > > swp_entry_t swpentry)
> > > {
> > > - struct zswap_tree *tree;
> > > + struct xarray *tree;
> > > + pgoff_t offset = swp_offset(swpentry);
> > > + struct zswap_entry *e;
> > > struct folio *folio;
> > > struct mempolicy *mpol;
> > > bool folio_was_allocated;
> > > @@ -1150,19 +1107,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > > * be dereferenced.
> > > */
> > > tree = swap_zswap_tree(swpentry);
> > > - spin_lock(&tree->lock);
> > > - if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> > > - spin_unlock(&tree->lock);
> > > + e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
> > > + if (e != entry) {
> >
> > I think we can avoid adding 'e' and 'offset' local variables here and
> > just do everything in the if condition. If you want to avoid the line
> > break, then introducing 'offset' is fine, but I don't see any value from
> > 'e'.
>
> As I said in my other email. I don't think having this type of local
> variable affects the compiler negatively. The compiler generally uses
> their own local variable to track the expression anyway. So I am not
> sure about the motivation to remove local variables alone, if it helps
> the reading. I feel the line "if (xa_cmpxchg(tree, offset, entry,
> NULL, GFP_KERNEL) != entry)" is too long and complicated inside the if
> condition. That is just me. Not a big deal.

I just think 'e' is not providing any readability improvements. If
anything, people need to pay closer attention to figure out 'e' is only
a temp variable and 'entry' is the real deal.

I vote for:
if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL))

[..]
> > > @@ -1471,10 +1423,12 @@ bool zswap_store(struct folio *folio)
> > > {
> > > swp_entry_t swp = folio->swap;
> > > pgoff_t offset = swp_offset(swp);
> > > - struct zswap_tree *tree = swap_zswap_tree(swp);
> > > - struct zswap_entry *entry, *dupentry;
> > > + struct xarray *tree = swap_zswap_tree(swp);
> > > + struct zswap_entry *entry, *old;
> > > struct obj_cgroup *objcg = NULL;
> > > struct mem_cgroup *memcg = NULL;
> > > + int err;
> > > + bool old_erased = false;
> > >
> > > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > > @@ -1526,6 +1480,7 @@ bool zswap_store(struct folio *folio)
> > > kunmap_local(src);
> > > entry->length = 0;
> > > entry->value = value;
> > > + entry->pool = NULL;
> >
> > Why do we need to initialize the pool here? Is this is a bug fix for an
> > existing problem or just keeping things clean? Either way I think it
> > should be done separately, unless it is related to a change in this
> > patch.
>
> I notice the entry->pool will leave uninitialized. I think it should
> be cleaned up. It is a clean up, it does not need to happen in this
> patch. I can do that as a separate patch.

Yes please.

[..]
> >
> > > /*
> > > * The folio may have been dirtied again, invalidate the
> > > * possibly stale entry before inserting the new entry.
> > > */
> > > - if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> > > - zswap_invalidate_entry(tree, dupentry);
> > > - WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> > > + err = zswap_xa_insert(tree, entry, &old);
> > > + if (old)
> > > + zswap_entry_free(old);
> > > + if (err) {
> > > + old_erased = true;
> >
> > I think this can be made simpler if we open code xa_store() here,
> > especially that we already have cleanup code below under 'check_old'
> > that removes the exisitng old entry. So zswap_xa_insert() replicates
> > this cleanup, then we add this 'old_erased' boolean to avoid doing the
> > cleanup below. It seems like it would much more straightforward with
> > open-coding xa_store() here and relying on the existing cleanup for the
> > old entry. Also, if we initialize 'old' to NULL, we can use its value
> > to figure out whether any cleanup is needed under 'check_old' or not.
>
> I think that is very similar to what Chengming was suggesting.
>
> >
> > Taking a step back, I think we can further simplify this. What if we
> > move the tree insertion to right after we allocate the zswap entry? In
> > this case, if the tree insertion fails, we don't need to decrement the
> > same filled counter. If the tree insertion succeeds and then something
> > else fails, the existing cleanup code under 'check_old' will already
> > clean up the tree insertion for us.
>
> That will create complications that, if the zswap compression fails
> the compression ratio, you will have to remove the entry from the tree
> as clean up. You have both xa_store() and xa_erase() where the current
> code just does one xa_erase() on compression failure.

Not really. If xa_store() fails because of -ENOMEM, then I think by
definition we do not need xa_erase() as there shouldn't be any stale
entries. I also think -ENOMEM should be the only valid errno from
xa_store() in this context. So we can avoid the check_old code if
xa_store() is called (whether it fails or succeeds) IIUC.

I prefer calling xa_store() entry and avoiding the extra 'insert_failed'
cleanup code, especially that unlike other cleanup code, it has its own
branching based on entry->length. I am also planning a cleanup for
zswap_store() to split the code better for the same_filled case and
avoid some unnecessary checks and failures, so it would be useful to
keep the common code path together.

>
> >
> > If this works, we don't need to add extra cleanup code or move any code
> > around. Something like:
>
> Due to the extra xa_insert() on compression failure, I think
> Chengming's or your earlier suggestion is better.
>
> BTW, while you are here, can you confirm this race discussed in
> earlier email can't happen? Chengming convinced me this shouldn't
> happen. Like to hear your thoughts.
>
> CPU1 CPU2
>
> xa_store()
> entry = xa_erase()
> zswap_free_entry(entry)
>
> if (entry->length)
> ...
> CPU1 is using entry after free.

IIUC, CPU1 is in zswap_store(), CPU2 could either in zswap_invalidate()
or zswap_load().

For zswap_load(), I think synchronization is done in the core swap code
ensure we are not doing parallel swapin/swapout at the same entry,
right? In this specific case, I think the folio would be in the
swapcache while swapout (i.e. zswap_store()) is ongoing, so any swapins
will read the folio and not call zswap_load().

Actually, if we do not prevent parallel swapin/swapou at the same entry,
I suspect we may have problems even outside of zswap. For example, we
may read a partially written swap entry from disk, right? Or does the
block layer synchronize this somehow?

For zswap_invalidate(), the core swap code calls it when the swap entry
is no longer used and before we free it for reuse, so IIUC parallel
swapouts (i.e. zswap_store()) should not be possible here as well.