Re: [PATCH v3 0/4] Swap-out small-sized THP without splitting

From: Barry Song
Date: Wed Nov 29 2023 - 02:48:14 EST


> Hi All,
>
> This is v3 of a series to add support for swapping out small-sized THP without
> needing to first split the large folio via __split_huge_page(). It closely
> follows the approach already used by PMD-sized THP.
>
> "Small-sized THP" is an upcoming feature that enables performance improvements
> by allocating large folios for anonymous memory, where the large folio size is
> smaller than the traditional PMD-size. See [3].
>
> In some circumstances I've observed a performance regression (see patch 2 for
> details), and this series is an attempt to fix the regression in advance of
> merging small-sized THP support.
>
> I've done what I thought was the smallest change possible, and as a result, this
> approach is only employed when the swap is backed by a non-rotating block device
> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
> that this is probably sufficient.
>
> The series applies against mm-unstable (1a3c85fa684a)
>
>
> Changes since v2 [2]
> ====================
>
> - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
> allocation. This required some refactoring to make everything work nicely
> (new patches 2 and 3).
> - Fix bug where nr_swap_pages would say there are pages available but the
> scanner would not be able to allocate them because they were reserved for the
> per-cpu allocator. We now allow stealing of order-0 entries from the high
> order per-cpu clusters (in addition to exisiting stealing from order-0
> per-cpu clusters).
>
> Thanks to Huang, Ying for the review feedback and suggestions!
>
>
> Changes since v1 [1]
> ====================
>
> - patch 1:
> - Use cluster_set_count() instead of cluster_set_count_flag() in
> swap_alloc_cluster() since we no longer have any flag to set. I was unable
> to kill cluster_set_count_flag() as proposed against v1 as other call
> sites depend explicitly setting flags to 0.
> - patch 2:
> - Moved large_next[] array into percpu_cluster to make it per-cpu
> (recommended by Huang, Ying).
> - large_next[] array is dynamically allocated because PMD_ORDER is not
> compile-time constant for powerpc (fixes build error).
>
>
> Thanks,
> Ryan

> P.S. I know we agreed this is not a prerequisite for merging small-sized THP,
> but given Huang Ying had provided some review feedback, I wanted to progress it.
> All the actual prerequisites are either complete or being worked on by others.
>

Hi Ryan,

this is quite important to a phone and a must-have component, so is large-folio
swapin, as i explained to you in another email.
Luckily, we are having Chuanhua Han(Cc-ed) to prepare a patchset of largefolio
swapin on top of your this patchset, probably a port and cleanup of our
do_swap_page[1] againest yours.

Another concern is that swapslots can be fragmented, if we place small/large folios
in a swap device, since large folios always require contiguous swapslot, we can
result in failure of getting slots even we still have many free slots which are not
contiguous. To avoid this, [2] dynamic hugepage solution have two swap devices,
one for basepage, the other one for CONTPTE. we have modified the priority-based
selection of swap devices to choose swap devices based on small/large folios.
i realize this approache is super ugly and might be very hard to find a way to
upstream though, it seems not universal especially if you are a linux server (-_-)

two devices are not a nice approach though it works well for a real product,
we might still need some decent way to address this problem while the problem
is for sure not a stopper of your patchset.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L4648
[2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/swapfile.c#L1129

>
> [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@xxxxxxx/
> [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@xxxxxxx/
> [3] https://lore.kernel.org/linux-mm/15a52c3d-9584-449b-8228-1335e0753b04@xxxxxxx/
>
>
> Ryan Roberts (4):
> mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
> mm: swap: Remove struct percpu_cluster
> mm: swap: Simplify ssd behavior when scanner steals entry
> mm: swap: Swap-out small-sized THP without splitting
>
> include/linux/swap.h | 31 +++---
> mm/huge_memory.c | 3 -
> mm/swapfile.c | 232 ++++++++++++++++++++++++-------------------
> mm/vmscan.c | 10 +-
> 4 files changed, 149 insertions(+), 127 deletions(-)

Thanks
Barry