Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags

From: David Hildenbrand
Date: Fri Mar 01 2024 - 12:19:14 EST


On 01.03.24 18:14, Ryan Roberts wrote:
On 01/03/2024 17:00, David Hildenbrand wrote:
On 01.03.24 17:44, Ryan Roberts wrote:
On 01/03/2024 16:31, Matthew Wilcox wrote:
On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
I've implemented the batching as David suggested, and I'm pretty confident it's
correct. The only problem is that during testing I can't provoke the code to
take the path. I've been pouring through the code but struggling to figure out
under what situation you would expect the swap entry passed to
free_swap_and_cache() to still have a cached folio? Does anyone have any idea?

This is the original (unbatched) function, after my change, which caused
David's
concern that we would end up calling __try_to_reclaim_swap() far too much:

int free_swap_and_cache(swp_entry_t entry)
{
    struct swap_info_struct *p;
    unsigned char count;

    if (non_swap_entry(entry))
        return 1;

    p = _swap_info_get(entry);
    if (p) {
        count = __swap_entry_free(p, entry);
        if (count == SWAP_HAS_CACHE)
            __try_to_reclaim_swap(p, swp_offset(entry),
                          TTRS_UNMAPPED | TTRS_FULL);
    }
    return p != NULL;
}

The trouble is, whenever its called, count is always 0, so
__try_to_reclaim_swap() never gets called.

My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT)
over
it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause
this
function to be called for every PTE, but count is always 0 after
__swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
order-0 as well as PTE- and PMD-mapped 2M THP.

I think you have to page it back in again, then it will have an entry in
the swap cache.  Maybe.  I know little about anon memory ;-)

Ahh, I was under the impression that the original folio is put into the swap
cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
I'm miles out... what exactly is the lifecycle of a folio going through swap out?

I thought with most (disk) backends you will add it to the swapcache and leave
it there until there is actual memory pressure. Only then, under memory
pressure, you'd actually reclaim the folio.

OK, my problem is that I'm using a VM, whose disk shows up as rotating media, so
the swap subsystem refuses to swap out THPs to that and they get split. To solve
that, (and to speed up testing) I moved to the block ram disk, which convinces
swap to swap-out THPs. But that causes the folios to be removed from the swap
cache (I assumed because its syncrhonous, but maybe there is a flag somewhere to
affect that behavior?) And I can't convince QEMU to emulate an SSD to the guest
under macos. Perhaps the easiest thing is to hack it to ignore the rotating
media flag.

I'm trying to remember how I triggered it in the past, I thought cow.c selftest was able to do that.

What certainly works is taking a reference on the page using vmsplice() and then doing the MADV_PAGEOUT. But there has to be a better way :)

I'll dig on Monday!

--
Cheers,

David / dhildenb