Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole

From: Barry Song
Date: Fri Mar 15 2024 - 06:02:06 EST


On Fri, Mar 15, 2024 at 10:17 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Barry Song <21cnbao@xxxxxxxxx> writes:
>
> > On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >>
> >> Barry Song <21cnbao@xxxxxxxxx> writes:
> >>
> >> > From: Chuanhua Han <hanchuanhua@xxxxxxxx>
> >> >
> >> > On an embedded system like Android, more than half of anon memory is
> >> > actually in swap devices such as zRAM. For example, while an app is
> >> > switched to background, its most memory might be swapped-out.
> >> >
> >> > Now we have mTHP features, unfortunately, if we don't support large folios
> >> > swap-in, once those large folios are swapped-out, we immediately lose the
> >> > performance gain we can get through large folios and hardware optimization
> >> > such as CONT-PTE.
> >> >
> >> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> >> > to those contiguous swaps which were likely swapped out from mTHP as a
> >> > whole.
> >> >
> >> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> >> > case. It doesn't support swapin_readahead as large folios yet since this
> >> > kind of shared memory is much less than memory mapped by single process.
> >>
> >> In contrast, I still think that it's better to start with normal swap-in
> >> path, then expand to SWAP_SYCHRONOUS case.
> >
> > I'd rather try the reverse direction as non-sync anon memory is only around
> > 3% in a phone as my observation.
>
> Phone is not the only platform that Linux is running on.

I suppose it's generally true that forked shared anonymous pages only
constitute a
small portion of all anonymous pages. The majority of anonymous pages are within
a single process.

I agree phones are not the only platform. But Rome wasn't built in a
day. I can only get
started on a hardware which I can easily reach and have enough hardware/test
resources on it. So we may take the first step which can be applied on
a real product
and improve its performance, and step by step, we broaden it and make it
widely useful to various areas in which I can't reach :-)

so probably we can have a sysfs "enable" entry with default "n" or
have a maximum
swap-in order as Ryan's suggestion [1] at the beginning,

"
So in the common case, swap-in will pull in the same size of folio as was
swapped-out. Is that definitely the right policy for all folio sizes? Certainly
it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
it makes sense for 2M THP; As the size increases the chances of actually needing
all of the folio reduces so chances are we are wasting IO. There are similar
arguments for CoW, where we currently copy 1 page per fault - it probably makes
sense to copy the whole folio up to a certain size.
"

>
> >>
> >> In normal swap-in path, we can take advantage of swap readahead
> >> information to determine the swapped-in large folio order. That is, if
> >> the return value of swapin_nr_pages() > 1, then we can try to allocate
> >> and swapin a large folio.
> >
> > I am not quite sure we still need to depend on this. in do_anon_page,
> > we have broken the assumption and allocated a large folio directly.
>
> I don't think that we have a sophisticated policy to allocate large
> folio. Large folio could waste memory for some workloads, so I think
> that it's a good idea to allocate large folio always.

i agree, but we still have the below check just like do_anon_page() has it,

orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
BIT(PMD_ORDER) - 1);
orders = thp_vma_suitable_orders(vma, vmf->address, orders);

in do_anon_page, we don't worry about the waste so much, the same
logic also applies to do_swap_page().

>
> Readahead gives us an opportunity to play with the policy.

I feel somehow the rules of the game have changed with an upper
limit for swap-in size. for example, if the upper limit is 4 order,
it limits folio size to 64KiB which is still a proper size for ARM64
whose base page can be 64KiB.

on the other hand, while swapping out large folios, we will always
compress them as a whole(zsmalloc/zram patch will come in a
couple of days), if we choose to decompress a subpage instead of
a large folio in do_swap_page(), we might need to decompress
nr_pages times. for example,

For large folios 16*4KiB, they are saved as a large object in zsmalloc(with
the coming patch), if we swap in a small folio, we decompress the large
object; next time, we will still need to decompress a large object. so
it is more sensible to swap in a large folio if we find those
swap entries are contiguous and were allocated by a large folio swap-out.

>
> > On the other hand, compressing/decompressing large folios as a
> > whole rather than doing it one by one can save a large percent of
> > CPUs and provide a much lower compression ratio. With a hardware
> > accelerator, this is even faster.
>
> I am not against to support large folio for compressing/decompressing.
>
> I just suggest to do that later, after we play with normal swap-in.
> SWAP_SYCHRONOUS related swap-in code is an optimization based on normal
> swap. So, it seems natural to support large folio swap-in for normal
> swap-in firstly.

I feel like SWAP_SYCHRONOUS is a simpler case and even more "normal"
than the swapcache path since it is the majority. and on the other hand, a lot
of modification is required for the swapcache path. in OPPO's code[1], we did
bring-up both paths, but the swapcache path is much much more complicated
than the SYNC path and hasn't really noticeable improvement.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/tree/oneplus/sm8650_u_14.0.0_oneplus12

>
> > So I'd rather more aggressively get large folios swap-in involved
> > than depending on readahead.
>
> We can take advantage of readahead algorithm in SWAP_SYCHRONOUS
> optimization too. The sub-pages that is not accessed by page fault can
> be treated as readahead. I think that is a better policy than
> allocating large folio always.

Considering the zsmalloc optimization, it would be a better choice to
always allocate
large folios if we find those swap entries are for a swapped-out large folio. as
decompressing just once, we get all subpages.
Some hardware accelerators are even able to decompress a large folio with
multi-hardware threads, for example, 16 hardware threads can compress
each subpage of a large folio at the same time, it is just as fast as
decompressing
one subpage.

for platforms without the above optimizations, a proper upper limit
will help them
disable the large folios swap-in or decrease the impact. For example,
if the upper
limit is 0-order, we are just removing this patchset. if the upper
limit is 2 orders, we
are just like BASE_PAGE size is 16KiB.

>
> >>
> >> To do that, we need to track whether the sub-pages are accessed. I
> >> guess we need that information for large file folio readahead too.
> >>
> >> Hi, Matthew,
> >>
> >> Can you help us on tracking whether the sub-pages of a readahead large
> >> folio has been accessed?
> >>
> >> > Right now, we are re-faulting large folios which are still in swapcache as a
> >> > whole, this can effectively decrease extra loops and early-exitings which we
> >> > have increased in arch_swap_restore() while supporting MTE restore for folios
> >> > rather than page. On the other hand, it can also decrease do_swap_page as
> >> > PTEs used to be set one by one even we hit a large folio in swapcache.
> >> >
> >>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry