Re: [PATCH v1 1/2] mm/madvise: introduce MADV_TRY_COLLAPSE for attempted synchronous hugepage collapse

From: Lance Yang
Date: Wed Jan 17 2024 - 20:51:55 EST


Hey David,

Thanks for taking the time to review!

David Hildenbrand <david@xxxxxxxxxx> 于2024年1月18日周四 02:41写道:
>
> On 17.01.24 18:10, Zach O'Keefe wrote:
> > [+linux-mm & others]
> >
> > On Tue, Jan 16, 2024 at 9:02 PM Lance Yang <ioworker0@xxxxxxxxx> wrote:
> >>
> >> This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
> >>
> >> Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
> >> make a least-effort attempt at a synchronous collapse of memory at
> >> their own expense.
> >>
> >> The only difference from MADV_COLLAPSE is that the new hugepage allocation
> >> avoids direct reclaim and/or compaction, quickly failing on allocation errors.
> >>
> >> The benefits of this approach are:
> >>
> >> * CPU is charged to the process that wants to spend the cycles for the THP
> >> * Avoid unpredictable timing of khugepaged collapse
> >> * Prevent unpredictable stalls caused by direct reclaim and/or compaction
> >>
> >> Semantics
> >>
> >> This call is independent of the system-wide THP sysfs settings, but will
> >> fail for memory marked VM_NOHUGEPAGE. If the ranges provided span
> >> multiple VMAs, the semantics of the collapse over each VMA is independent
> >> from the others. This implies a hugepage cannot cross a VMA boundary. If
> >> collapse of a given hugepage-aligned/sized region fails, the operation may
> >> continue to attempt collapsing the remainder of memory specified.
> >>
> >> The memory ranges provided must be page-aligned, but are not required to
> >> be hugepage-aligned. If the memory ranges are not hugepage-aligned, the
> >> start/end of the range will be clamped to the first/last hugepage-aligned
> >> address covered by said range. The memory ranges must span at least one
> >> hugepage-sized region.
> >>
> >> All non-resident pages covered by the range will first be
> >> swapped/faulted-in, before being internally copied onto a freshly
> >> allocated hugepage. Unmapped pages will have their data directly
> >> initialized to 0 in the new hugepage. However, for every eligible
> >> hugepage aligned/sized region to-be collapsed, at least one page must
> >> currently be backed by memory (a PMD covering the address range must
> >> already exist).
> >>
> >> Allocation for the new hugepage will not enter direct reclaim and/or
> >> compaction, quickly failing if allocation fails. When the system has
> >> multiple NUMA nodes, the hugepage will be allocated from the node providing
> >> the most native pages. This operation operates on the current state of the
> >> specified process and makes no persistent changes or guarantees on how pages
> >> will be mapped, constructed, or faulted in the future.
> >>
> >> Return Value
> >>
> >> If all hugepage-sized/aligned regions covered by the provided range were
> >> either successfully collapsed, or were already PMD-mapped THPs, this
> >> operation will be deemed successful. On success, madvise(2) returns 0.
> >> Else, -1 is returned and errno is set to indicate the error for the
> >> most-recently attempted hugepage collapse. Note that many failures might
> >> have occurred, since the operation may continue to collapse in the event a
> >> single hugepage-sized/aligned region fails.
> >>
> >> ENOMEM Memory allocation failed or VMA not found
> >> EBUSY Memcg charging failed
> >> EAGAIN Required resource temporarily unavailable. Try again
> >> might succeed.
> >> EINVAL Other error: No PMD found, subpage doesn't have Present
> >> bit set, "Special" page no backed by struct page, VMA
> >> incorrectly sized, address not page-aligned, ...
> >>
> >> Use Cases
> >>
> >> An immediate user of this new functionality is the Go runtime heap allocator
> >> that manages memory in hugepage-sized chunks. In the past, whether it was a
> >> newly allocated chunk through mmap() or a reused chunk released by
> >> madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> >> huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> >> respectively. However, both approaches resulted in performance issues; for
> >> both scenarios, there could be entries into direct reclaim and/or compaction,
> >> leading to unpredictable stalls[4]. Now, the allocator can confidently use
> >> madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.
> >>
> >> [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> >> [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> >> [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> >> [4] https://github.com/golang/go/issues/63334
> >
> > Thanks for the patch, Lance, and thanks for providing the links above,
> > referring to issues Go has seen.
> >
> > I've reached out to the Go team to try and understand their use case,
> > and how we could help. It's not immediately clear whether a
> > lighter-weight MADV_COLLAPSE is the answer, but it could turn out to
> > be.
> >
> > That said, with respect to the implementation, should a need for a
> > lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see
> > process_madvise(2) be the "v2" of madvise(2), where we can start
> > leveraging the forward-facing flags argument for these different
> > advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa
> > ("mm/madvise: remove racy mm ownership check") so that
> > process_madvise(2) can always operate on self. IIRC, this was ~ the
> > plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a
> > sane default, and implement options in flags down the line).
>
> +1, using process_madvise() would likely be the right approach.

Thanks for your suggestion! I completely agree :)
Lance

>
> --
> Cheers,
>
> David / dhildenb
>