Re: [PATCH v2 1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process_madvise()

From: Lance Yang
Date: Fri Jan 26 2024 - 05:31:14 EST


I would like to correct the information provided in my previous
email and also provide some additional information.

On Fri, Jan 26, 2024 at 2:16 PM Lance Yang <ioworker0@xxxxxxxxx> wrote:
>
> I’d like to add another real use case.
>
> In our company, we deploy applications using offline-online
> hybrid deployment. This approach leverages the distinctive
> resource utilization patterns of online services, utilizing idle
> resources during various time periods by filling them with
> offline jobs. This helps reduce the growing cost expenditures
> for the enterprise.
>
> Whether for online services or offline jobs, their requirements
> for THP can be roughly categorized into three types:
>
> * The first type aims to use huge pages as much as possible
> and tolerates unpredictable stalls caused by direct reclaim
> and/or compaction.
> * The second type attempts to use huge pages but is relatively
> latency-sensitive and cannot tolerate unpredictable stalls.
> * The third type prefers not to use huge pages at all and is
> extremely latency-sensitive.
>
> After careful consideration, we decided to prioritize the
> requirements of the first type and modify the THP settings
> as follows:
>
> echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> echo defer >/sys/kernel/mm/transparent_hugepage/defrag
>
> With the introduction of MADV_COLLAPSE into the kernel,
> it is no longer dependent on any sysfs setting under
> /sys/kernel/mm/transparent_hugepage. MADV_COLLAPSE
> offers the potential for fine-grained synchronous control over
> the huge page allocation mechanism, marking a significant
> enhancement for THP.
>
> If the kernel supports a more relaxed (opportunistic)
> MADV_COLLAPSE, we will modify the THP settings as follows:
>
> echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> echo madvise >/sys/kernel/mm/transparent_hugepage/defrag

The correct THP settings should be:
echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag

>
> Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag)
> to address the requirements of the second type.
>
> Why don't we favor madvise(MADV_COLLAPSE) for the first type
> of requirements?
> The main reason is that these requirements are typically for offline
> jobs in the Hadoop ecosystem, such as MapReduce and Spark,
> which run primarily on the JVM. IIRC, the JVM currently does not
> support madvise(MADV_COLLAPSE). The second type of

To add, there are also some offline jobs that rely on PyTorch for
machine learning model training tasks. IIRC, PyTorch also does
not support madvise(MADV_COLLAPSE).

Thanks,
Lance

> requirements is all for our in-house developed online services.
> For us, integrating a more relaxed (opportunistic)
> MADV_COLLAPSE into our online services is relatively
> straightforward.
>
> By introducing various flags to MADV_COLLAPSE, we can offer
> multiple synchronous allocation strategies for applications. This
> fine-grained control may be more suitable for cloud-native
> environments than the widespread settings under
> /sys/kernel/mm/transparent_hugepage in sysfs.
>
> Thanks for your time!
> Lance
>
> On Sun, Jan 21, 2024 at 11:12 AM Lance Yang <ioworker0@xxxxxxxxx> wrote:
> >
> > Hello Everyone,
> >
> > For applications actively utilizing THP, the defrag mode may
> > not be a very user-friendly design. Here are the reasons:
> > 1. Before marking the address space with
> > MADV_HUGEPAGE,it is necessary to check if
> > the current configuration of the defrag mode aligns with
> > their preferences.
> > 2. Once the defrag mode configuration changes, these
> > applications may face the risk of unpredictable stalls.
> >
> > THP is an important feature of the Linux kernel that can
> > significantly enhance memory access performance.
> > However, due to the lack of fine-grained control over
> > the huge page allocation strategy, many applications
> > default to not using huge pages and even recommend
> > users to disable THP. This situation is regrettable.
> >
> > With the introduction of MADV_COLLAPSE into the kernel,
> > it is not affected by the defrag mode.
> > MADV_COLLAPSE offers the potential for
> > fine-grained synchronous control over the huge page
> > allocation mechanism, marking a significant enhancement
> > for THP.
> >
> > By adding flags to MADV_COLLAPSE, different
> > synchronous allocation strategies can be provided to
> > applications. This can instill confidence in them, allowing
> > them to reconsider using THP and allocate huge pages
> > according to their desired synchronous allocation strategy,
> > without worrying about the defrag mode configuration.
> >
> > BR,
> > Lance
> >
> >
> > On Thu, Jan 18, 2024 at 8:03 PM Lance Yang <ioworker0@xxxxxxxxx> wrote:
> > >
> > > This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
> > >
> > > Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
> > > has CAP_SYS_ADMIN or is requesting the collapse of its own memory.
> > >
> > > The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
> > > it avoids direct reclaim and/or compaction, quickly failing on allocation
> > > errors.
> > >
> > > This change enables a more flexible and efficient usage of memory collapse
> > > operations, providing additional control to userspace applications for
> > > system-wide THP optimization.
> > >
> > > Semantics
> > >
> > > This call is independent of the system-wide THP sysfs settings, but will
> > > fail for memory marked VM_NOHUGEPAGE. If the ranges provided span
> > > multiple VMAs, the semantics of the collapse over each VMA is independent
> > > from the others. This implies a hugepage cannot cross a VMA boundary If
> > > collapse of a given hugepage-aligned/sized region fails, the operation may
> > > continue to attempt collapsing the remainder of memory specified.
> > >
> > > The memory ranges provided must be page-aligned, but are not required to
> > > be hugepage-aligned. If the memory ranges are not hugepage-aligned, the
> > > start/end of the range will be clamped to the first/last hugepage-aligned
> > > address covered by said range. The memory ranges must span at least one
> > > hugepage-sized region.
> > >
> > > All non-resident pages covered by the range will first be
> > > swapped/faulted-in, before being internally copied onto a freshly
> > > allocated hugepage. Unmapped pages will have their data directly
> > > initialized to 0 in the new hugepage. However, for every eligible
> > > hugepage aligned/sized region to-be collapsed, at least one page must
> > > currently be backed by memory (a PMD covering the address range must
> > > already exist).
> > >
> > > Allocation for the new hugepage will not enter direct reclaim and/or
> > > compaction, quickly failing if allocation fails. When the system has
> > > multiple NUMA nodes, the hugepage will be allocated from the node providing
> > > the most native pages. This operation operates on the current state of the
> > > specified process and makes no persistent changes or guarantees on how pages
> > > will be mapped, constructed, or faulted in the future.
> > >
> > > Use Cases
> > >
> > > An immediate user of this new functionality is the Go runtime heap allocator
> > > that manages memory in hugepage-sized chunks. In the past, whether it was a
> > > newly allocated chunk through mmap() or a reused chunk released by
> > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> > > respectively. However, both approaches resulted in performance issues; for
> > > both scenarios, there could be entries into direct reclaim and/or compaction,
> > > leading to unpredictable stalls[4]. Now, the allocator can confidently use
> > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
> > >
> > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> > > [4] https://github.com/golang/go/issues/63334
> > >
> > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@xxxxxxxxx/
> > >
> > > Signed-off-by: Lance Yang <ioworker0@xxxxxxxxx>
> > > Suggested-by: Zach O'Keefe <zokeefe@xxxxxxxxxx>
> > > Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
> > > ---
> > > V1 -> V2: Treat process_madvise(MADV_F_COLLAPSE_LIGHT) as the lighter-weight alternative
> > > to madvise(MADV_COLLAPSE)
> > >
> > > arch/alpha/include/uapi/asm/mman.h | 1 +
> > > arch/mips/include/uapi/asm/mman.h | 1 +
> > > arch/parisc/include/uapi/asm/mman.h | 1 +
> > > arch/xtensa/include/uapi/asm/mman.h | 1 +
> > > include/linux/huge_mm.h | 5 +--
> > > include/uapi/asm-generic/mman-common.h | 1 +
> > > mm/khugepaged.c | 15 ++++++--
> > > mm/madvise.c | 36 +++++++++++++++++---
> > > tools/include/uapi/asm-generic/mman-common.h | 1 +
> > > 9 files changed, 52 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > > index 763929e814e9..22f23ca04f1a 100644
> > > --- a/arch/alpha/include/uapi/asm/mman.h
> > > +++ b/arch/alpha/include/uapi/asm/mman.h
> > > @@ -77,6 +77,7 @@
> > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
> > >
> > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> > >
> > > /* compatibility flags */
> > > #define MAP_FILE 0
> > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > > index c6e1fc77c996..acec0b643e9c 100644
> > > --- a/arch/mips/include/uapi/asm/mman.h
> > > +++ b/arch/mips/include/uapi/asm/mman.h
> > > @@ -104,6 +104,7 @@
> > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
> > >
> > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> > >
> > > /* compatibility flags */
> > > #define MAP_FILE 0
> > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > > index 68c44f99bc93..812029c98cd7 100644
> > > --- a/arch/parisc/include/uapi/asm/mman.h
> > > +++ b/arch/parisc/include/uapi/asm/mman.h
> > > @@ -71,6 +71,7 @@
> > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
> > >
> > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> > >
> > > #define MADV_HWPOISON 100 /* poison a page for testing */
> > > #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */
> > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > > index 1ff0c858544f..52ef463dd5b6 100644
> > > --- a/arch/xtensa/include/uapi/asm/mman.h
> > > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > > @@ -112,6 +112,7 @@
> > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
> > >
> > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> > >
> > > /* compatibility flags */
> > > #define MAP_FILE 0
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index 5adb86af35fc..075fdb5d481a 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> > > int advice);
> > > int madvise_collapse(struct vm_area_struct *vma,
> > > struct vm_area_struct **prev,
> > > - unsigned long start, unsigned long end);
> > > + unsigned long start, unsigned long end, int behavior);
> > > void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> > > unsigned long end, long adjust_next);
> > > spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> > > @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
> > >
> > > static inline int madvise_collapse(struct vm_area_struct *vma,
> > > struct vm_area_struct **prev,
> > > - unsigned long start, unsigned long end)
> > > + unsigned long start, unsigned long end,
> > > + int behavior)
> > > {
> > > return -EINVAL;
> > > }
> > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > index 6ce1f1ceb432..92c67bc755da 100644
> > > --- a/include/uapi/asm-generic/mman-common.h
> > > +++ b/include/uapi/asm-generic/mman-common.h
> > > @@ -78,6 +78,7 @@
> > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
> > >
> > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> > >
> > > /* compatibility flags */
> > > #define MAP_FILE 0
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 2b219acb528e..2840051c0ae2 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -97,6 +97,8 @@ static struct kmem_cache *mm_slot_cache __ro_after_init;
> > > struct collapse_control {
> > > bool is_khugepaged;
> > >
> > > + int behavior;
> > > +
> > > /* Num pages scanned per node */
> > > u32 node_load[MAX_NUMNODES];
> > >
> > > @@ -1058,10 +1060,16 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> > > static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> > > struct collapse_control *cc)
> > > {
> > > - gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> > > - GFP_TRANSHUGE);
> > > int node = hpage_collapse_find_target_node(cc);
> > > struct folio *folio;
> > > + gfp_t gfp;
> > > +
> > > + if (cc->is_khugepaged)
> > > + gfp = alloc_hugepage_khugepaged_gfpmask();
> > > + else
> > > + gfp = (cc->behavior == MADV_F_COLLAPSE_LIGHT ?
> > > + GFP_TRANSHUGE_LIGHT :
> > > + GFP_TRANSHUGE);
> > >
> > > if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) {
> > > *hpage = NULL;
> > > @@ -2697,7 +2705,7 @@ static int madvise_collapse_errno(enum scan_result r)
> > > }
> > >
> > > int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > - unsigned long start, unsigned long end)
> > > + unsigned long start, unsigned long end, int behavior)
> > > {
> > > struct collapse_control *cc;
> > > struct mm_struct *mm = vma->vm_mm;
> > > @@ -2718,6 +2726,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > if (!cc)
> > > return -ENOMEM;
> > > cc->is_khugepaged = false;
> > > + cc->behavior = behavior;
> > >
> > > mmgrab(mm);
> > > lru_add_drain_all();
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 912155a94ed5..9c40226505aa 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior)
> > > case MADV_POPULATE_READ:
> > > case MADV_POPULATE_WRITE:
> > > case MADV_COLLAPSE:
> > > + case MADV_F_COLLAPSE_LIGHT:
> > > return 0;
> > > default:
> > > /* be safe, default to 1. list exceptions explicitly */
> > > @@ -1082,8 +1083,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> > > if (error)
> > > goto out;
> > > break;
> > > + case MADV_F_COLLAPSE_LIGHT:
> > > case MADV_COLLAPSE:
> > > - return madvise_collapse(vma, prev, start, end);
> > > + return madvise_collapse(vma, prev, start, end, behavior);
> > > }
> > >
> > > anon_name = anon_vma_name(vma);
> > > @@ -1178,6 +1180,7 @@ madvise_behavior_valid(int behavior)
> > > case MADV_HUGEPAGE:
> > > case MADV_NOHUGEPAGE:
> > > case MADV_COLLAPSE:
> > > + case MADV_F_COLLAPSE_LIGHT:
> > > #endif
> > > case MADV_DONTDUMP:
> > > case MADV_DODUMP:
> > > @@ -1194,6 +1197,17 @@ madvise_behavior_valid(int behavior)
> > > }
> > > }
> > >
> > > +
> > > +static bool process_madvise_behavior_only(int behavior)
> > > +{
> > > + switch (behavior) {
> > > + case MADV_F_COLLAPSE_LIGHT:
> > > + return true;
> > > + default:
> > > + return false;
> > > + }
> > > +}
> > > +
> > > static bool process_madvise_behavior_valid(int behavior)
> > > {
> > > switch (behavior) {
> > > @@ -1201,6 +1215,7 @@ static bool process_madvise_behavior_valid(int behavior)
> > > case MADV_PAGEOUT:
> > > case MADV_WILLNEED:
> > > case MADV_COLLAPSE:
> > > + case MADV_F_COLLAPSE_LIGHT:
> > > return true;
> > > default:
> > > return false;
> > > @@ -1368,6 +1383,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > > * transparent huge pages so the existing pages will not be
> > > * coalesced into THP and new pages will not be allocated as THP.
> > > * MADV_COLLAPSE - synchronously coalesce pages into new THP.
> > > + * MADV_F_COLLAPSE_LIGHT - only for process_madvise, avoids direct reclaim and/or
> > > + * compaction.
> > > * MADV_DONTDUMP - the application wants to prevent pages in the given range
> > > * from being included in its core dump.
> > > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > > @@ -1394,7 +1411,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > > * -EBADF - map exists, but area maps something that isn't a file.
> > > * -EAGAIN - a kernel resource was temporarily unavailable.
> > > */
> > > -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> > > +int _do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> > > + int behavior, bool is_process_madvise)
> > > {
> > > unsigned long end;
> > > int error;
> > > @@ -1405,6 +1423,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> > > if (!madvise_behavior_valid(behavior))
> > > return -EINVAL;
> > >
> > > + if (!is_process_madvise && process_madvise_behavior_only(behavior))
> > > + return -EINVAL;
> > > +
> > > if (!PAGE_ALIGNED(start))
> > > return -EINVAL;
> > > len = PAGE_ALIGN(len_in);
> > > @@ -1448,9 +1469,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> > > return error;
> > > }
> > >
> > > +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> > > +{
> > > + return _do_madvise(mm, start, len_in, behavior, false);
> > > +}
> > > +
> > > SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > > {
> > > - return do_madvise(current->mm, start, len_in, behavior);
> > > + return _do_madvise(current->mm, start, len_in, behavior, false);
> > > }
> > >
> > > SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > > @@ -1504,8 +1530,8 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > > total_len = iov_iter_count(&iter);
> > >
> > > while (iov_iter_count(&iter)) {
> > > - ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
> > > - iter_iov_len(&iter), behavior);
> > > + ret = _do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
> > > + iter_iov_len(&iter), behavior, true);
> > > if (ret < 0)
> > > break;
> > > iov_iter_advance(&iter, iter_iov_len(&iter));
> > > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> > > index 6ce1f1ceb432..92c67bc755da 100644
> > > --- a/tools/include/uapi/asm-generic/mman-common.h
> > > +++ b/tools/include/uapi/asm-generic/mman-common.h
> > > @@ -78,6 +78,7 @@
> > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
> > >
> > > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> > > +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
> > >
> > > /* compatibility flags */
> > > #define MAP_FILE 0
> > > --
> > > 2.33.1
> > >