Re: [PATCH v1 1/2] mm/madvise: introduce MADV_TRY_COLLAPSE for attempted synchronous hugepage collapse

From: Lance Yang
Date: Wed Jan 17 2024 - 20:47:11 EST


Hey Zach,

Thanks for taking the time to review!

Zach O'Keefe <zokeefe@xxxxxxxxxx> 于2024年1月18日周四 01:11写道:
>
> [+linux-mm & others]
>
> On Tue, Jan 16, 2024 at 9:02 PM Lance Yang <ioworker0@xxxxxxxxx> wrote:
> >
> > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
> >
> > Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
> > make a least-effort attempt at a synchronous collapse of memory at
> > their own expense.
> >
> > The only difference from MADV_COLLAPSE is that the new hugepage allocation
> > avoids direct reclaim and/or compaction, quickly failing on allocation errors.
> >
> > The benefits of this approach are:
> >
> > * CPU is charged to the process that wants to spend the cycles for the THP
> > * Avoid unpredictable timing of khugepaged collapse
> > * Prevent unpredictable stalls caused by direct reclaim and/or compaction
> >
> > Semantics
> >
> > This call is independent of the system-wide THP sysfs settings, but will
> > fail for memory marked VM_NOHUGEPAGE. If the ranges provided span
> > multiple VMAs, the semantics of the collapse over each VMA is independent
> > from the others. This implies a hugepage cannot cross a VMA boundary. If
> > collapse of a given hugepage-aligned/sized region fails, the operation may
> > continue to attempt collapsing the remainder of memory specified.
> >
> > The memory ranges provided must be page-aligned, but are not required to
> > be hugepage-aligned. If the memory ranges are not hugepage-aligned, the
> > start/end of the range will be clamped to the first/last hugepage-aligned
> > address covered by said range. The memory ranges must span at least one
> > hugepage-sized region.
> >
> > All non-resident pages covered by the range will first be
> > swapped/faulted-in, before being internally copied onto a freshly
> > allocated hugepage. Unmapped pages will have their data directly
> > initialized to 0 in the new hugepage. However, for every eligible
> > hugepage aligned/sized region to-be collapsed, at least one page must
> > currently be backed by memory (a PMD covering the address range must
> > already exist).
> >
> > Allocation for the new hugepage will not enter direct reclaim and/or
> > compaction, quickly failing if allocation fails. When the system has
> > multiple NUMA nodes, the hugepage will be allocated from the node providing
> > the most native pages. This operation operates on the current state of the
> > specified process and makes no persistent changes or guarantees on how pages
> > will be mapped, constructed, or faulted in the future.
> >
> > Return Value
> >
> > If all hugepage-sized/aligned regions covered by the provided range were
> > either successfully collapsed, or were already PMD-mapped THPs, this
> > operation will be deemed successful. On success, madvise(2) returns 0.
> > Else, -1 is returned and errno is set to indicate the error for the
> > most-recently attempted hugepage collapse. Note that many failures might
> > have occurred, since the operation may continue to collapse in the event a
> > single hugepage-sized/aligned region fails.
> >
> > ENOMEM Memory allocation failed or VMA not found
> > EBUSY Memcg charging failed
> > EAGAIN Required resource temporarily unavailable. Try again
> > might succeed.
> > EINVAL Other error: No PMD found, subpage doesn't have Present
> > bit set, "Special" page no backed by struct page, VMA
> > incorrectly sized, address not page-aligned, ...
> >
> > Use Cases
> >
> > An immediate user of this new functionality is the Go runtime heap allocator
> > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > newly allocated chunk through mmap() or a reused chunk released by
> > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > respectively. However, both approaches resulted in performance issues; for
> > both scenarios, there could be entries into direct reclaim and/or compaction,
> > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.
> >
> > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> > [4] https://github.com/golang/go/issues/63334
>
> Thanks for the patch, Lance, and thanks for providing the links above,
> referring to issues Go has seen.
>
> I've reached out to the Go team to try and understand their use case,
> and how we could help. It's not immediately clear whether a
> lighter-weight MADV_COLLAPSE is the answer, but it could turn out to
> be.
>
> That said, with respect to the implementation, should a need for a
> lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see
> process_madvise(2) be the "v2" of madvise(2), where we can start

I agree with you; it's a good idea!

> leveraging the forward-facing flags argument for these different
> advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa
> ("mm/madvise: remove racy mm ownership check") so that
> process_madvise(2) can always operate on self. IIRC, this was ~ the
> plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a
> sane default, and implement options in flags down the line).
>
> That flag could be a MADV_F_COLLAPSE_LIGHT, where we use a lighter

The name MADV_F_COLLAPSE_LIGHT sounds great for the flag, and its
semantics are very clear.

Thanks again for your review and your suggestion!
Lance

> allocation context, as well as, for example, only do a local
> lru_add_drain() vs lru_add_drain_all(). But I'll refrain from thinking
> too hard about it just yet.
>
> Best,
> Zach
>
>
>
>
> > Signed-off-by: Lance Yang <ioworker0@xxxxxxxxx>
> > ---
> > arch/alpha/include/uapi/asm/mman.h | 1 +
> > arch/mips/include/uapi/asm/mman.h | 1 +
> > arch/parisc/include/uapi/asm/mman.h | 1 +
> > arch/xtensa/include/uapi/asm/mman.h | 1 +
> > include/linux/huge_mm.h | 5 +++--
> > include/uapi/asm-generic/mman-common.h | 1 +
> > mm/khugepaged.c | 19 ++++++++++++++++---
> > mm/madvise.c | 8 +++++++-
> > tools/include/uapi/asm-generic/mman-common.h | 1 +
> > 9 files changed, 32 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > index 763929e814e9..44aa1f57a982 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -77,6 +77,7 @@
> > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
> >
> > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> > /* compatibility flags */
> > #define MAP_FILE 0
> > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > index c6e1fc77c996..1ae16e5d7dfc 100644
> > --- a/arch/mips/include/uapi/asm/mman.h
> > +++ b/arch/mips/include/uapi/asm/mman.h
> > @@ -104,6 +104,7 @@
> > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
> >
> > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> > /* compatibility flags */
> > #define MAP_FILE 0
> > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > index 68c44f99bc93..f8d016ee1f98 100644
> > --- a/arch/parisc/include/uapi/asm/mman.h
> > +++ b/arch/parisc/include/uapi/asm/mman.h
> > @@ -71,6 +71,7 @@
> > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
> >
> > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> > #define MADV_HWPOISON 100 /* poison a page for testing */
> > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */
> > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > index 1ff0c858544f..c495d1b39c83 100644
> > --- a/arch/xtensa/include/uapi/asm/mman.h
> > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > @@ -112,6 +112,7 @@
> > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
> >
> > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> > /* compatibility flags */
> > #define MAP_FILE 0
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 5adb86af35fc..e1af75aa18fb 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> > int advice);
> > int madvise_collapse(struct vm_area_struct *vma,
> > struct vm_area_struct **prev,
> > - unsigned long start, unsigned long end);
> > + unsigned long start, unsigned long end, bool is_try);
> > void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> > unsigned long end, long adjust_next);
> > spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> > @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
> >
> > static inline int madvise_collapse(struct vm_area_struct *vma,
> > struct vm_area_struct **prev,
> > - unsigned long start, unsigned long end)
> > + unsigned long start, unsigned long end,
> > + bool is_try)
> > {
> > return -EINVAL;
> > }
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 6ce1f1ceb432..a9e5273db5f6 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -78,6 +78,7 @@
> > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
> >
> > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> > /* compatibility flags */
> > #define MAP_FILE 0
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 2b219acb528e..c22703155b6e 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -96,6 +96,7 @@ static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > struct collapse_control {
> > bool is_khugepaged;
> > + bool is_try;
> >
> > /* Num pages scanned per node */
> > u32 node_load[MAX_NUMNODES];
> > @@ -1058,10 +1059,14 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> > static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> > struct collapse_control *cc)
> > {
> > - gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> > - GFP_TRANSHUGE);
> > int node = hpage_collapse_find_target_node(cc);
> > struct folio *folio;
> > + gfp_t gfp;
> > +
> > + if (cc->is_khugepaged)
> > + gfp = alloc_hugepage_khugepaged_gfpmask();
> > + else
> > + gfp = cc->is_try ? GFP_TRANSHUGE_LIGHT : GFP_TRANSHUGE;
> >
> > if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) {
> > *hpage = NULL;
> > @@ -2697,7 +2702,7 @@ static int madvise_collapse_errno(enum scan_result r)
> > }
> >
> > int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > - unsigned long start, unsigned long end)
> > + unsigned long start, unsigned long end, bool is_try)
> > {
> > struct collapse_control *cc;
> > struct mm_struct *mm = vma->vm_mm;
> > @@ -2718,6 +2723,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > if (!cc)
> > return -ENOMEM;
> > cc->is_khugepaged = false;
> > + cc->is_try = is_try;
> >
> > mmgrab(mm);
> > lru_add_drain_all();
> > @@ -2773,6 +2779,13 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > result = collapse_pte_mapped_thp(mm, addr, true);
> > mmap_read_unlock(mm);
> > goto handle_result;
> > + /* MADV_TRY_COLLAPSE: fail quickly */
> > + case SCAN_ALLOC_HUGE_PAGE_FAIL:
> > + case SCAN_CGROUP_CHARGE_FAIL:
> > + if (cc->is_try) {
> > + last_fail = result;
> > + goto out_maybelock;
> > + }
> > /* Whitelisted set of results where continuing OK */
> > case SCAN_PMD_NULL:
> > case SCAN_PTE_NON_PRESENT:
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 912155a94ed5..5a359bcd286c 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior)
> > case MADV_POPULATE_READ:
> > case MADV_POPULATE_WRITE:
> > case MADV_COLLAPSE:
> > + case MADV_TRY_COLLAPSE:
> > return 0;
> > default:
> > /* be safe, default to 1. list exceptions explicitly */
> > @@ -1082,8 +1083,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> > if (error)
> > goto out;
> > break;
> > + case MADV_TRY_COLLAPSE:
> > + return madvise_collapse(vma, prev, start, end, true);
> > case MADV_COLLAPSE:
> > - return madvise_collapse(vma, prev, start, end);
> > + return madvise_collapse(vma, prev, start, end, false);
> > }
> >
> > anon_name = anon_vma_name(vma);
> > @@ -1178,6 +1181,7 @@ madvise_behavior_valid(int behavior)
> > case MADV_HUGEPAGE:
> > case MADV_NOHUGEPAGE:
> > case MADV_COLLAPSE:
> > + case MADV_TRY_COLLAPSE:
> > #endif
> > case MADV_DONTDUMP:
> > case MADV_DODUMP:
> > @@ -1368,6 +1372,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > * transparent huge pages so the existing pages will not be
> > * coalesced into THP and new pages will not be allocated as THP.
> > * MADV_COLLAPSE - synchronously coalesce pages into new THP.
> > + * MADV_TRY_COLLAPSE - similar to COLLAPSE, but avoids direct reclaim
> > + * and/or compaction.
> > * MADV_DONTDUMP - the application wants to prevent pages in the given range
> > * from being included in its core dump.
> > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> > index 6ce1f1ceb432..a9e5273db5f6 100644
> > --- a/tools/include/uapi/asm-generic/mman-common.h
> > +++ b/tools/include/uapi/asm-generic/mman-common.h
> > @@ -78,6 +78,7 @@
> > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
> >
> > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> > +#define MADV_TRY_COLLAPSE 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> >
> > /* compatibility flags */
> > #define MAP_FILE 0
> > --
> > 2.33.1
> >