Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags

From: Ryan Roberts
Date: Tue Mar 05 2024 - 03:46:50 EST

Next message: Ard Biesheuvel: "Re: [PATCH v7 0/9] x86: Confine early 1:1 mapped startup code"
Previous message: Shenghao Ding: "[PATCH v10] The tas2783 is a smart audio amplifier with integrated MIPI SoundWire interface (Version 1.2.1 compliant), I2C, and I2S/TDM interfaces designed for portable applications. An on-chip DSP supports Texas Instruments SmartAmp speaker protection algorithm. The integrated speaker voltage and current sense provides for real-time monitoring of loudspeakers."
In reply to: David Hildenbrand: "Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 05/03/2024 08:35, David Hildenbrand wrote:
> On 05.03.24 07:11, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
>>
>>> + Hugh
>>>
>>> On 04/03/2024 22:02, David Hildenbrand wrote:
>>>> On 04.03.24 22:55, Ryan Roberts wrote:
>>>>> On 04/03/2024 20:50, David Hildenbrand wrote:
>>>>>>>>>
>>>>>>>>> This is the existing free_swap_and_cache(). I think _swap_info_get() would
>>>>>>>>> break
>>>>>>>>> if this could race with swapoff(), and __swap_entry_free() looks up the
>>>>>>>>> cluster
>>>>>>>>> from an array, which would also be freed by swapoff if racing:
>>>>>>>>>
>>>>>>>>> int free_swap_and_cache(swp_entry_t entry)
>>>>>>>>> {
>>>>>>>>>        struct swap_info_struct *p;
>>>>>>>>>        unsigned char count;
>>>>>>>>>
>>>>>>>>>        if (non_swap_entry(entry))
>>>>>>>>>            return 1;
>>>>>>>>>
>>>>>>>>>        p = _swap_info_get(entry);
>>>>>>>>>        if (p) {
>>>>>>>>>            count = __swap_entry_free(p, entry);
>>>>>>>>
>>>>>>>> If count dropped to 0 and
>>>>>>>>
>>>>>>>>>            if (count == SWAP_HAS_CACHE)
>>>>>>>>
>>>>>>>>
>>>>>>>> count is now SWAP_HAS_CACHE, there is in fact no swap entry anymore. We
>>>>>>>> removed
>>>>>>>> it. That one would have to be reclaimed asynchronously.
>>>>>>>>
>>>>>>>> The existing code we would call swap_page_trans_huge_swapped() with the
>>>>>>>> SI it
>>>>>>>> obtained via _swap_info_get().
>>>>>>>>
>>>>>>>> I also don't see what should be left protecting the SI. It's not locked
>>>>>>>> anymore,
>>>>>>>> the swapcounts are at 0. We don't hold the folio lock.
>>>>>>>>
>>>>>>>> try_to_unuse() will stop as soon as si->inuse_pages is at 0. Hm ...
>>>>>>>
>>>>>>> But, assuming the caller of free_swap_and_cache() acquires the PTL first, I
>>>>>>> think this all works out ok? While free_swap_and_cache() is running,
>>>>>>> try_to_unuse() will wait for the PTL. Or if try_to_unuse() runs first, then
>>>>>>> free_swap_and_cache() will never be called because the swap entry will have
>>>>>>> been
>>>>>>> removed from the PTE?
>>>>>>
>>>>>> But can't try_to_unuse() run, detect !si->inuse_pages and not even bother
>>>>>> about
>>>>>> scanning any further page tables?
>>>>>>
>>>>>> But my head hurts from digging through that code.
>>>>>
>>>>> Yep, glad I'm not the only one that gets headaches from swapfile.c.
>>>>>
>>>>>>
>>>>>> Let me try again:
>>>>>>
>>>>>> __swap_entry_free() might be the last user and result in "count ==
>>>>>> SWAP_HAS_CACHE".
>>>>>>
>>>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>>>>
>>>>>>
>>>>>> So the question is: could someone reclaim the folio and turn
>>>>>> si->inuse_pages==0,
>>>>>> before we completed swap_page_trans_huge_swapped().
>>>>>>
>>>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>>>>> still
>>>>>> references by swap entries.
>>>>>>
>>>>>> Process 1 still references subpage 0 via swap entry.
>>>>>> Process 2 still references subpage 1 via swap entry.
>>>>>>
>>>>>> Process 1 quits. Calls free_swap_and_cache().
>>>>>> -> count == SWAP_HAS_CACHE
>>>>>> [then, preempted in the hypervisor etc.]
>>>>>>
>>>>>> Process 2 quits. Calls free_swap_and_cache().
>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>
>>>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>>>>> __try_to_reclaim_swap().
>>>>>>
>>>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->put_swap_folio()->
>>>>>> free_swap_slot()->swapcache_free_entries()->swap_entry_free()->swap_range_free()->
>>>>>> ...
>>>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>>>>
>>>>>>
>>>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache but
>>>>>> before process 1 finished its call to swap_page_trans_huge_swapped()?
>>>>>
>>>>> Assuming you are talking about anonymous memory, process 1 has the PTL while
>>>>> it's executing free_swap_and_cache(). try_to_unuse() iterates over every
>>>>> vma in
>>>>> every mm, and it swaps-in a page for every PTE that holds a swap entry for the
>>>>> device being swapoff'ed. It takes the PTL while converting the swap entry to
>>>>> present PTE - see unuse_pte(). Process 1 must have beaten try_to_unuse() to
>>>>> the
>>>>> particular pte, because if try_to_unuse() got there first, it would have
>>>>> converted it from a swap entry to present pte and process 1 would never even
>>>>> have called free_swap_and_cache(). So try_to_unuse() will eventually wait
>>>>> on the
>>>>> PTL until process 1 has released it after free_swap_and_cache() completes.
>>>>> Am I
>>>>> missing something? Because that part feels pretty clear to me.
>>>>
>>>> Why should try_to_unuse() do *anything* if it already finds
>>>> si->inuse_pages == 0 because we (p1 } p2) just freed the swapentries and
>>>> process
>>>> 2 managed to free the last remaining swapcache entry?
>>>
>>> Yeah ok. For some reason I thought unuse_mm() was iterating over all mms and so
>>> the `while (READ_ONCE(si->inuse_pages))` was only evaluated after iterating over
>>> every mm. Oops.
>>>
>>> So yes, I agree with you; I think this is broken. And I'm a bit worried this
>>> could be a can of worms; By the same logic, I think folio_free_swap(),
>>> swp_swapcount() and probably others are broken in the same way.
>>
>> Don't worry too much :-), we have get_swap_device() at least. We can
>> insert it anywhere we want because it's quite lightweight. And, because
>> swapoff() is so rare, the race is theoretical only.

Thanks for the response!

>>
>> For this specific case, I had thought that PTL is enough. But after
>> looking at this more, I found a race here too. Until
>> __swap_entry_free() return, we are OK, nobody can reduce the swap count
>> because we held the PTL.

Even that is not true for the shmem case: As far as I can see, shmem doesn't
have the PTL or any other synchronizing lock when it calls
free_swap_and_cache(). I don't think that changes anything solution-wise though.

>> But, after that, even if its return value is
>> SWAP_HAS_CACHE (that is, in swap cache), parallel swap_unuse() or
>> __try_to_reclaim_swap() may remove the folio from swap cache, so free
>> the swap entry. So, swapoff() can proceed to free the data structures
>> in parallel.
>>
>> To fix the race, we can add get/put_swap_device() in
>> free_swap_and_cache().
>>
>> For other places, we can check whether get/put_swap_device() has been
>> called in callers, and the swap reference we held has been decreased
>> (e.g., swap count protected by PTL, SWAP_HAS_CACHE protected by folio
>> lock).
>
> Yes, sounds reasonable. We should likely update the documentation of
> get_swap_device(), that after decrementing the refcount, the SI might become
> stale and should not be touched without a prior get_swap_device().

Yep agreed. If nobody else is planning to do it, I'll try to create a test case
that provokes the problem then put a patch together to fix it.