Re: [PATCH v2 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free

From: Barry Song
Date: Thu Mar 07 2024 - 02:01:00 EST


On Thu, Mar 7, 2024 at 7:15 PM Lance Yang <ioworker0@xxxxxxxxx> wrote:
>
> This patch optimizes lazyfreeing with PTE-mapped mTHP[1]
> (Inspired by David Hildenbrand[2]). We aim to avoid unnecessary
> folio splitting if the large folio is entirely within the given
> range.
>
> On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by
> PTE-mapped folios of the same size results in the following
> runtimes for madvise(MADV_FREE) in seconds (shorter is better):
>
> Folio Size | Old | New | Change
> ------------------------------------------
> 4KiB | 0.590251 | 0.590259 | 0%
> 16KiB | 2.990447 | 0.185655 | -94%
> 32KiB | 2.547831 | 0.104870 | -95%
> 64KiB | 2.457796 | 0.052812 | -97%
> 128KiB | 2.281034 | 0.032777 | -99%
> 256KiB | 2.230387 | 0.017496 | -99%
> 512KiB | 2.189106 | 0.010781 | -99%
> 1024KiB | 2.183949 | 0.007753 | -99%
> 2048KiB | 0.002799 | 0.002804 | 0%
>
> [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@xxxxxxx
> [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhatcom/
>
> Signed-off-by: Lance Yang <ioworker0@xxxxxxxxx>
> ---
> v1 -> v2:
> * Update the performance numbers
> * Update the changelog, suggested by Ryan Roberts
> * Check the COW folio, suggested by Yin Fengwei
> * Check if we are mapping all subpages, suggested by Barry Song,
> David Hildenbrand, Ryan Roberts
> * https://lore.kernel.org/linux-mm/20240225123215.86503-1-ioworker0@xxxxxxxxx/
>
> mm/madvise.c | 85 +++++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 74 insertions(+), 11 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 44a498c94158..1437ac6eb25e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -616,6 +616,20 @@ static long madvise_pageout(struct vm_area_struct *vma,
> return 0;
> }
>
> +static inline bool can_mark_large_folio_lazyfree(unsigned long addr,
> + struct folio *folio, pte_t *start_pte)
> +{
> + int nr_pages = folio_nr_pages(folio);
> + fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +
> + for (int i = 0; i < nr_pages; i++)
> + if (page_mapcount(folio_page(folio, i)) != 1)
> + return false;

we have moved to folio_estimated_sharers though it is not precise, so
we don't do
this check with lots of loops and depending on the subpage's mapcount.
BTW, do we need to rebase our work against David's changes[1]?
[1] https://lore.kernel.org/linux-mm/20240227201548.857831-1-david@xxxxxxxxxx/

> +
> + return nr_pages == folio_pte_batch(folio, addr, start_pte,
> + ptep_get(start_pte), nr_pages, flags, NULL);
> +}
> +
> static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> unsigned long end, struct mm_walk *walk)
>
> @@ -676,11 +690,45 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> */
> if (folio_test_large(folio)) {
> int err;
> + unsigned long next_addr, align;
>
> - if (folio_estimated_sharers(folio) != 1)
> - break;
> - if (!folio_trylock(folio))
> - break;
> + if (folio_estimated_sharers(folio) != 1 ||
> + !folio_trylock(folio))
> + goto skip_large_folio;


I don't think we can skip all the PTEs for nr_pages, as some of them might be
pointing to other folios.

for example, for a large folio with 16PTEs, you do MADV_DONTNEED(15-16),
and write the memory of PTE15 and PTE16, you get page faults, thus PTE15
and PTE16 will point to two different small folios. We can only skip when we
are sure nr_pages == folio_pte_batch() is sure.

> +
> + align = folio_nr_pages(folio) * PAGE_SIZE;
> + next_addr = ALIGN_DOWN(addr + align, align);
> +
> + /*
> + * If we mark only the subpages as lazyfree, or
> + * cannot mark the entire large folio as lazyfree,
> + * then just split it.
> + */
> + if (next_addr > end || next_addr - addr != align ||
> + !can_mark_large_folio_lazyfree(addr, folio, pte))
> + goto split_large_folio;
> +
> + /*
> + * Avoid unnecessary folio splitting if the large
> + * folio is entirely within the given range.
> + */
> + folio_clear_dirty(folio);
> + folio_unlock(folio);
> + for (; addr != next_addr; pte++, addr += PAGE_SIZE) {
> + ptent = ptep_get(pte);
> + if (pte_young(ptent) || pte_dirty(ptent)) {
> + ptent = ptep_get_and_clear_full(
> + mm, addr, pte, tlb->fullmm);
> + ptent = pte_mkold(ptent);
> + ptent = pte_mkclean(ptent);
> + set_pte_at(mm, addr, pte, ptent);
> + tlb_remove_tlb_entry(tlb, pte, addr);
> + }

Can we do this in batches? for a CONT-PTE mapped large folio, you are unfolding
and folding again. It seems quite expensive.

> + }
> + folio_mark_lazyfree(folio);
> + goto next_folio;
> +
> +split_large_folio:
> folio_get(folio);
> arch_leave_lazy_mmu_mode();
> pte_unmap_unlock(start_pte, ptl);
> @@ -688,13 +736,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> err = split_folio(folio);
> folio_unlock(folio);
> folio_put(folio);
> - if (err)
> - break;
> - start_pte = pte =
> - pte_offset_map_lock(mm, pmd, addr, &ptl);
> - if (!start_pte)
> - break;
> - arch_enter_lazy_mmu_mode();
> +
> + /*
> + * If the large folio is locked or cannot be split,
> + * we just skip it.
> + */
> + if (err) {
> +skip_large_folio:
> + if (next_addr >= end)
> + break;
> + pte += (next_addr - addr) / PAGE_SIZE;
> + addr = next_addr;
> + }
> +
> + if (!start_pte) {
> + start_pte = pte = pte_offset_map_lock(
> + mm, pmd, addr, &ptl);
> + if (!start_pte)
> + break;
> + arch_enter_lazy_mmu_mode();
> + }
> +
> +next_folio:
> pte--;
> addr -= PAGE_SIZE;
> continue;
> --
> 2.33.1
>

Thanks
Barry