Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.

From: Hugh Dickins
Date: Mon Aug 09 2021 - 00:05:09 EST

Next message: Hao Xu: "Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work"
Previous message: Desmond Cheong Zhi Xi: "Re: [RESEND PATCH v5 1/6] Bluetooth: schedule SCO timeouts with delayed_work"
In reply to: Mike Rapoport: "Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter."
Next in thread: Hugh Dickins: "Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 6 Aug 2021, Zi Yan wrote:
> On 6 Aug 2021, at 16:27, Hugh Dickins wrote:
> > On Fri, 6 Aug 2021, Zi Yan wrote:
> >>
> >> In addition, I would like to share more detail on my plan on supporting 1GB PUD THP.
> >> This patchset is the first step, enabling kernel to allocate 1GB pages, so that
> >> user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using
> >> alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory
> >> fragmentation handling for pages up to MAX_ORDER, since currently pageblock size
> >> is still limited by memory section size. As a result, I will explore solutions
> >> like having additional larger pageblocks (up to MAX_ORDER) to counter memory
> >> fragmentation. I will discover what else needs to be solved as I gradually improve
> >> 1GB PUD THP support.
> >
> > Sorry to be blunt, but let me state my opinion: 2MB THPs have given and
> > continue to give us more than enough trouble. Complicating the kernel's
> > mm further, just to allow 1GB THPs, seems a very bad tradeoff to me. I
> > understand that it's an appealing personal project; but for the sake of
> > of all the rest of us, please leave 1GB huge pages to hugetlbfs (until
> > the day when we are all using 2MB base pages).
>
> I do not agree with you. 2MB THP provides good performance, while letting us
> keep using 4KB base pages. The 2MB THP implementation is the price we pay
> to get the performance. This patchset removes the tie between MAX_ORDER
> and section size to allow >2MB page allocation, which is useful in many
> places. 1GB THP is one of the users. Gigantic pages also improve
> device performance, like GPUs (e.g., AMD GPUs can use any power of two up to
> 1GB pages[1], which I just learnt). Also could you point out which part
> of my patchset complicates kernel’s mm? I could try to simplify it if
> possible.
>
> In addition, I am not sure hugetlbfs is the way to go. THP is managed by
> core mm, whereas hugetlbfs has its own code for memory management.
> As hugetlbfs gets popular, more core mm functionalities have been
> replicated and added to hugetlbfs codebase. It is not a good tradeoff
> either. One of the reasons I work on 1GB THP is that Roman from Facebook
> explicitly mentioned they want to use THP in place of hugetlbfs[2].
>
> I think it might be more constructive to point out the existing issues
> in THP so that we can improve the code together. BTW, I am also working
> on simplifying THP code like generalizing THP split[3] and planning to
> simplify page table manipulation code by reviving Kirill’s idea[4].

You may have good reasons for working on huge PUD entry support;
and perhaps we have different understandings of "THP".

Fragmentation: that's what horrifies me about 1GB THP.

The dark side of THP is compaction. People have put in a lot of effort
to get compaction working as well as it currently does, but getting 512
adjacent 4k pages is not easy. Getting 512*512 adjacent 4k pages is
very much harder. Please put in the work on compaction before you
attempt to support 1GB THP.

Related fears: unexpected latencies; unacceptable variance between runs;
frequent rebooting of machines to get back to an unfragmented state;
page table code that most of us will never be in a position to test.

Sorry, no, I'm not reading your patches: that's not personal, it's
just that I've more than enough to do already, and must make choices.

Hugh

>
> [1] https://lore.kernel.org/linux-mm/bdec12bd-9188-9f3e-c442-aa33e25303a6@xxxxxxx/
> [2] https://lore.kernel.org/linux-mm/20200903162527.GF60440@xxxxxxxxxxxxxxxxxxxxxxxxxxx/
> [3] https://lwn.net/Articles/837928/
> [4] https://lore.kernel.org/linux-mm/20180424154355.mfjgkf47kdp2by4e@xxxxxxxxxxxxxxxxxx/
>
> —
> Best Regards,
> Yan, Zi

Next message: Hao Xu: "Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work"
Previous message: Desmond Cheong Zhi Xi: "Re: [RESEND PATCH v5 1/6] Bluetooth: schedule SCO timeouts with delayed_work"
In reply to: Mike Rapoport: "Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter."
Next in thread: Hugh Dickins: "Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]