Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff

From: Nhat Pham
Date: Fri Jan 26 2024 - 14:50:28 EST


On Fri, Jan 26, 2024 at 12:32 AM <chengming.zhou@xxxxxxxxx> wrote:
>
> From: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx>
>
> LRU writeback has race problem with swapoff, as spotted by Yosry[1]:
>
> CPU1 CPU2
> shrink_memcg_cb swap_off
> list_lru_isolate zswap_invalidate
> zswap_swapoff
> kfree(tree)
> // UAF
> spin_lock(&tree->lock)
>
> The problem is that the entry in lru list can't protect the tree from
> being swapoff and freed, and the entry also can be invalidated and freed
> concurrently after we unlock the lru lock.
>
> We can fix it by moving the swap cache allocation ahead before
> referencing the tree, then check invalidate race with tree lock,
> only after that we can safely deref the entry. Note we couldn't
> deref entry or tree anymore after we unlock the folio, since we
> depend on this to hold on swapoff.
>
> So this patch moves all tree and entry usage to zswap_writeback_entry(),
> we only use the copied swpentry on the stack to allocate swap cache
> and return with folio locked, after which we can reference the tree.
> Then check invalidate race with tree lock, the following things is
> much the same like zswap_load().
>
> Since we can't deref the entry after zswap_writeback_entry(), we
> can't use zswap_lru_putback() anymore, instead we rotate the entry

I added list_lru_putback to the list_lru API specifically for this use
case (zswap_lru_putback()). Now that we no longer need it, maybe we
can also remove this as well (assuming no-one else is using this?).

This can be done in a separate patch though.

> in the LRU list so concurrent reclaimers have little chance to see it.
> Or it will be deleted from LRU list if writeback success.
>
> Another confusing part to me is the update of memcg nr_zswap_protected
> in zswap_lru_putback(). I'm not sure why it's needed here since
> if we raced with swapin, memcg nr_zswap_protected has already been
> updated in zswap_folio_swapin(). So not include this part for now.
>
> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@xxxxxxxxxxxxxx/
>
> Signed-off-by: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx>

LGTM! This is quite elegant.
Acked-by: Nhat Pham <nphamcs@xxxxxxxxx>

> ---
> mm/zswap.c | 93 ++++++++++++++++++------------------------------------
> 1 file changed, 31 insertions(+), 62 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 81cb3790e0dd..fa2bdb7ec1d8 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> zpool_get_type((p)->zpools[0]))
>
> static int zswap_writeback_entry(struct zswap_entry *entry,
> - struct zswap_tree *tree);
> + swp_entry_t swpentry);
> static int zswap_pool_get(struct zswap_pool *pool);
> static void zswap_pool_put(struct zswap_pool *pool);
>
> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
> rcu_read_unlock();
> }
>
> -static void zswap_lru_putback(struct list_lru *list_lru,
> - struct zswap_entry *entry)
> -{
> - int nid = entry_to_nid(entry);
> - spinlock_t *lock = &list_lru->node[nid].lock;
> - struct mem_cgroup *memcg;
> - struct lruvec *lruvec;
> -
> - rcu_read_lock();
> - memcg = mem_cgroup_from_entry(entry);
> - spin_lock(lock);
> - /* we cannot use list_lru_add here, because it increments node's lru count */
> - list_lru_putback(list_lru, &entry->lru, nid, memcg);
> - spin_unlock(lock);
> -
> - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
> - /* increment the protection area to account for the LRU rotation. */
> - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> - rcu_read_unlock();
> -}
> -
> /*********************************
> * rbtree functions
> **********************************/
> @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> {
> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> bool *encountered_page_in_swapcache = (bool *)arg;
> - struct zswap_tree *tree;
> - pgoff_t swpoffset;
> + swp_entry_t swpentry;
> enum lru_status ret = LRU_REMOVED_RETRY;
> int writeback_result;
>
> + /*
> + * First rotate to the tail of lru list before unlocking lru lock,
> + * so the concurrent reclaimers have little chance to see it.
> + * It will be deleted from the lru list if writeback success.
> + */
> + list_move_tail(item, &l->list);
> +
> /*
> * Once the lru lock is dropped, the entry might get freed. The
> - * swpoffset is copied to the stack, and entry isn't deref'd again
> + * swpentry is copied to the stack, and entry isn't deref'd again
> * until the entry is verified to still be alive in the tree.
> */
> - swpoffset = swp_offset(entry->swpentry);
> - tree = swap_zswap_tree(entry->swpentry);
> - list_lru_isolate(l, item);

nit: IIUC, now that we're no longer removing the entry from the
list_lru, we protect against concurrent shrinking action via this
check inside zswap_writeback_entry() too right:

if (!folio_was_allocated) {
folio_put(folio);
return -EEXIST;
}

Maybe update the comment above it too?

> + swpentry = entry->swpentry;
> +
> /*
> * It's safe to drop the lock here because we return either
> * LRU_REMOVED_RETRY or LRU_RETRY.
> */
> spin_unlock(lock);
>
> - /* Check for invalidate() race */
> - spin_lock(&tree->lock);
> - if (entry != zswap_rb_search(&tree->rbroot, swpoffset))
> - goto unlock;
> -
> - /* Hold a reference to prevent a free during writeback */
> - zswap_entry_get(entry);
> - spin_unlock(&tree->lock);
> -
> - writeback_result = zswap_writeback_entry(entry, tree);
> + writeback_result = zswap_writeback_entry(entry, swpentry);
>
> - spin_lock(&tree->lock);
> if (writeback_result) {
> zswap_reject_reclaim_fail++;
> - zswap_lru_putback(&entry->pool->list_lru, entry);
> ret = LRU_RETRY;
>
> /*
> @@ -903,27 +876,10 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> */
> if (writeback_result == -EEXIST && encountered_page_in_swapcache)
> *encountered_page_in_swapcache = true;
> -
> - goto put_unlock;
> + } else {
> + zswap_written_back_pages++;
> }
> - zswap_written_back_pages++;
> -
> - if (entry->objcg)
> - count_objcg_event(entry->objcg, ZSWPWB);
> -
> - count_vm_event(ZSWPWB);
> - /*
> - * Writeback started successfully, the page now belongs to the
> - * swapcache. Drop the entry from zswap - unless invalidate already
> - * took it out while we had the tree->lock released for IO.
> - */
> - zswap_invalidate_entry(tree, entry);
>
> -put_unlock:
> - /* Drop local reference */
> - zswap_entry_put(entry);
> -unlock:
> - spin_unlock(&tree->lock);
> spin_lock(lock);
> return ret;
> }
> @@ -1408,9 +1364,9 @@ static void __zswap_load(struct zswap_entry *entry, struct page *page)
> * freed.
> */
> static int zswap_writeback_entry(struct zswap_entry *entry,
> - struct zswap_tree *tree)
> + swp_entry_t swpentry)
> {
> - swp_entry_t swpentry = entry->swpentry;
> + struct zswap_tree *tree;
> struct folio *folio;
> struct mempolicy *mpol;
> bool folio_was_allocated;
> @@ -1442,18 +1398,31 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> * backs (our zswap_entry reference doesn't prevent that), to
> * avoid overwriting a new swap folio with old compressed data.
> */
> + tree = swap_zswap_tree(swpentry);
> spin_lock(&tree->lock);
> - if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) {
> + if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> spin_unlock(&tree->lock);
> delete_from_swap_cache(folio);
> folio_unlock(folio);
> folio_put(folio);
> return -ENOMEM;
> }
> +
> + /* Safe to deref entry after the entry is verified above. */
> + zswap_entry_get(entry);
> spin_unlock(&tree->lock);
>
> __zswap_load(entry, &folio->page);
>
> + count_vm_event(ZSWPWB);
> + if (entry->objcg)
> + count_objcg_event(entry->objcg, ZSWPWB);
> +
> + spin_lock(&tree->lock);
> + zswap_invalidate_entry(tree, entry);
> + zswap_entry_put(entry);
> + spin_unlock(&tree->lock);
> +
> /* folio is up to date */
> folio_mark_uptodate(folio);
>
> --
> 2.40.1
>