Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization

From: Kairui Song
Date: Wed Mar 27 2024 - 07:05:26 EST


On Wed, Mar 27, 2024 at 4:27 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> [...]
>
> >>> Test 1, sequential swapin/out of 30G zero page on ZRAM:
> >>>
> >>> Before (us) After (us)
> >>> Swapout: 33619409 33886008
> >>> Swapin: 32393771 32465441 (- 0.2%)
> >>> Swapout (THP): 7817909 6899938 (+11.8%)
> >>> Swapin (THP) : 32452387 33193479 (- 2.2%)
> >>
> >> If my understanding were correct, we don't have swapin (THP) support,
> >> yet. Right?
> >
> > Yes, this series doesn't change how swapin/swapout works with THP in
> > general, but now THP swapout will leave shadows with large order, so
> > it needs to be splitted upon swapin, that will slow down later swapin
> > by a little bit but I think that's worth it.
> >
> > If we can do THP swapin in the future, this split on swapin can be
> > saved to make the performance even better.
>
> I'm confused by this (clearly my understanding of how this works is incorrect).
> Perhaps you can help me understand:
>
> When you talk about "shadows" I assume you are referring to the swap cache? It
> was my understanding that swapping out a THP would always leave the large folio
> in the swap cache, so this is nothing new?
>
> And on swap-in, if the target page is in the swap cache, even if part of a large
> folio, why does it need to be split? I assumed the single page would just be
> mapped? (and if all the other pages subsequently fault, then you end up with a
> fully mapped large folio back in the process)?
>
> Perhaps I'm misunderstanding what "shadows" are?

Hi Ryan

My bad I haven't made this clear.

Ying have posted the link to the commit that added "shadow" support
for anon pages, it has become a very important part for LRU activation
/ workingset tracking. Basically when folios are removed from the
cache xarray (eg. after swap writeback is done), instead of releasing
the xarray slot, an unsigned long / void * is stored to it, recording
some info that will be used when refault happens, to decide how to
handle the folio from LRU / workingset side.

And about large folio in swapcahce: if you look at the current version
of add_to_swap_cache in mainline (it adds a folio of any order into
swap cache), it calls xas_create_range(&xas) which fill all xarray
slots in entire range covered by the folio. But xarray supports
multi-index storing, making use of the nature of the radix tree to
save a lot of slots. eg. for a 2M THP page, previously 8 + 512 slots
(8 extra xa nodes) is needed to store it, after this series it only
needs 8 slots by using a multi-index store. (not sure if I did the
math right).

Same for shadow, when folio is being deleted, __delete_from_swap_cache
will currently walk the xarray with xas_next update all 8 + 512 slots
one by one, after this series only 8 stores are needed (ignoring
fragmentation).

And upon swapin, I was talking about swapin 1 sub page of a THP folio,
and the folio is gone, leaving a few multi-index shadow slots. The
multi-index slots need to be splitted (multi-index slot have to be
updated as a whole or split first, __filemap_add_folio handles such
split), I optimize and reused routine in __filemap_add_folio in this
series so without too much work it works perfectly for swapcache.