Re: [PATCH v1] mm: swap: Fix race between free_swap_and_cache() and swapoff()

From: Ryan Roberts
Date: Wed Mar 06 2024 - 04:12:10 EST


On 06/03/2024 02:52, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
>
>> There was previously a theoretical window where swapoff() could run and
>> teardown a swap_info_struct while a call to free_swap_and_cache() was
>> running in another thread. This could cause, amongst other bad
>> possibilities, swap_page_trans_huge_swapped() (called by
>> free_swap_and_cache()) to access the freed memory for swap_map.
>>
>> This is a theoretical problem and I haven't been able to provoke it from
>> a test case. But there has been agreement based on code review that this
>> is possible (see link below).
>>
>> Fix it by using get_swap_device()/put_swap_device(), which will stall
>> swapoff(). There was an extra check in _swap_info_get() to confirm that
>> the swap entry was valid. This wasn't present in get_swap_device() so
>> I've added it. I couldn't find any existing get_swap_device() call sites
>> where this extra check would cause any false alarms.
>>
>> Details of how to provoke one possible issue (thanks to David Hilenbrand
>> for deriving this):
>>
>> --8<-----
>>
>> __swap_entry_free() might be the last user and result in
>> "count == SWAP_HAS_CACHE".
>>
>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>
>> So the question is: could someone reclaim the folio and turn
>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
>>
>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>> still references by swap entries.
>>
>> Process 1 still references subpage 0 via swap entry.
>> Process 2 still references subpage 1 via swap entry.
>>
>> Process 1 quits. Calls free_swap_and_cache().
>> -> count == SWAP_HAS_CACHE
>> [then, preempted in the hypervisor etc.]
>>
>> Process 2 quits. Calls free_swap_and_cache().
>> -> count == SWAP_HAS_CACHE
>>
>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>> __try_to_reclaim_swap().
>>
>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
>> swap_entry_free()->swap_range_free()->
>> ...
>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>
>> What stops swapoff to succeed after process 2 reclaimed the swap cache
>> but before process1 finished its call to swap_page_trans_huge_swapped()?
>>
>> --8<-----
>
> I think that this can be simplified. Even for a 4K folio, this could
> happen.

I'm not so sure...

>
> CPU0 CPU1
> ---- ----
>
> zap_pte_range
> free_swap_and_cache
> __swap_entry_free
> /* swap count become 0 */
> swapoff
> try_to_unuse
> filemap_get_folio
> folio_free_swap
> /* remove swap cache */
> /* free si->swap_map[] */
>
> swap_page_trans_huge_swapped <-- access freed si->swap_map !!!

I don't think si->inuse_pages is decremented until __try_to_reclaim_swap() is
called (per David, above), which is called after swap_page_trans_huge_swapped()
has executed. So in CPU1, try_to_unuse() wouldn't see si->inuse_pages being zero
until after CPU0 has completed accessing si->swap_map, so if swapoff starts
where you have put it, it would get stalled waiting for the PTL which CPU0 has.

I'm sure there are other ways that this could be racy for 4K folios, but I don't
think this is one of them? e.g. if CPU0 does something with si after it has
decremented si->inuse_pages, then there is a problem.

>
>> Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch")
>> Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@xxxxxxxxxx/
>> Cc: stable@xxxxxxxxxxxxxxx
>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
>> ---
>>
>> Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4).
>>
>> Thanks,
>> Ryan
>>
>> mm/swapfile.c | 14 +++++++++++---
>> 1 file changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 2b3a2d85e350..f580e6abc674 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>> smp_rmb();
>> offset = swp_offset(entry);
>> if (offset >= si->max)
>> - goto put_out;
>> + goto bad_offset;
>> + if (data_race(!si->swap_map[swp_offset(entry)]))
>> + goto bad_free;
>>
>> return si;
>> bad_nofile:
>> @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>> out:
>> return NULL;
>> put_out:
>> - pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
>> percpu_ref_put(&si->users);
>> return NULL;
>> +bad_offset:
>> + pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
>> + goto put_out;
>> +bad_free:
>> + pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
>> + goto put_out;
>> }
>
> I don't think that it's a good idea to warn for bad free entries.
> get_swap_device() could be called without enough lock to prevent
> parallel swap operations on entry. So, false positive is possible
> there. I think that it's good to add more checks in general, for
> example, in free_swap_and_cache(), we can check more because we are sure
> the swap entry will not be freed by parallel swap operations.

Yes, agreed. Johannes also reported that he is seeing false alarms due to this.
I'm going to remove it and add it to free_swap_and_cache() as you suggest. This
also fits well for the batched version I'm working on where we want to check the
global si things once, but need to check !free for every entry in the loop
(aiming to post that this week).

Just
> don't put the check in general get_swap_device(). We can add another
> helper to check that.
>
> I found that there are other checks in get_swap_device() already. I
> think that we may need to revert,
>
> commit 23b230ba8ac3 ("mm/swap: print bad swap offset entry in get_swap_device")

Yes agree this should be reverted.

>
> which introduces it. And add check in appropriate places.

I'm not quite sure what the "appropriate places" are. Looking at the call sites
for get_swap_device(), it looks like they are all racy except
free_swap_and_cache() which is called with the PTL. So check should only really
go there?

But... free_swap_and_cache() is called without the PTL by shmem (in 2 places -
see shmem_free_swap() wrapper). It looks like in one of those places, the folio
lock is held, so I guess this has a similar effect to holding the PTL. But the
other shmem_free_swap() call site doesn't appear to have the folio lock. Am I
missing something, or does this mean that for this path, free_swap_and_cache()
is racy and therefore we shouldn't be doing either the `offset >= si->max` or
the `!swap_map[offset]` in free_swap_and_cache() either?

>
> --
> Best Regards,
> Huang, Ying
>
>> static unsigned char __swap_entry_free(struct swap_info_struct *p,
>> @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry)
>> if (non_swap_entry(entry))
>> return 1;
>>
>> - p = _swap_info_get(entry);
>> + p = get_swap_device(entry);
>> if (p) {
>> count = __swap_entry_free(p, entry);
>> if (count == SWAP_HAS_CACHE &&
>> !swap_page_trans_huge_swapped(p, entry))
>> __try_to_reclaim_swap(p, swp_offset(entry),
>> TTRS_UNMAPPED | TTRS_FULL);
>> + put_swap_device(p);
>> }
>> return p != NULL;
>> }
>> --
>> 2.25.1