Re: [PATCH v2] mm: hugetlb: optionally allocate gigantic hugepages using cma

From: Roman Gushchin
Date: Tue Mar 10 2020 - 14:06:18 EST


On Tue, Mar 10, 2020 at 10:27:01AM -0700, Mike Kravetz wrote:
> On 3/9/20 5:25 PM, Roman Gushchin wrote:
> > Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
> > at runtime") has added the run-time allocation of gigantic pages. However
> > it actually works only at early stages of the system loading, when
> > the majority of memory is free. After some time the memory gets
> > fragmented by non-movable pages, so the chances to find a contiguous
> > 1 GB block are getting close to zero. Even dropping caches manually
> > doesn't help a lot.
> >
> > At large scale rebooting servers in order to allocate gigantic hugepages
> > is quite expensive and complex. At the same time keeping some constant
> > percentage of memory in reserved hugepages even if the workload isn't
> > using it is a big waste: not all workloads can benefit from using 1 GB
> > pages.
> >
> > The following solution can solve the problem:
> > 1) On boot time a dedicated cma area* is reserved. The size is passed
> > as a kernel argument.
> > 2) Run-time allocations of gigantic hugepages are performed using the
> > cma allocator and the dedicated cma area
> >
> > In this case gigantic hugepages can be allocated successfully with a
> > high probability, however the memory isn't completely wasted if nobody
> > is using 1GB hugepages: it can be used for pagecache, anon memory,
> > THPs, etc.
> >
> > * On a multi-node machine a per-node cma area is allocated on each node.
> > Following gigantic hugetlb allocation are using the first available
> > numa node if the mask isn't specified by a user.
> >
> > Usage:
> > 1) configure the kernel to allocate a cma area for hugetlb allocations:
> > pass hugetlb_cma=10G as a kernel argument
> >
> > 2) allocate hugetlb pages as usual, e.g.
> > echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >
> > If the option isn't enabled or the allocation of the cma area failed,
> > the current behavior of the system is preserved.
> >
> > Only x86 is covered by this patch, but it's trivial to extend it to
> > cover other architectures as well.
> >
> > v2: fixed !CONFIG_CMA build, suggested by Andrew Morton
> >
> > Signed-off-by: Roman Gushchin <guro@xxxxxx>
>
> Thanks! I really like this idea.

Thank you!

>
> > ---
> > .../admin-guide/kernel-parameters.txt | 7 ++
> > arch/x86/kernel/setup.c | 3 +
> > include/linux/hugetlb.h | 2 +
> > mm/hugetlb.c | 115 ++++++++++++++++++
> > 4 files changed, 127 insertions(+)
> >
> <snip>
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index a74262c71484..ceeb06ddfd41 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -16,6 +16,7 @@
> > #include <linux/pci.h>
> > #include <linux/root_dev.h>
> > #include <linux/sfi.h>
> > +#include <linux/hugetlb.h>
> > #include <linux/tboot.h>
> > #include <linux/usb/xhci-dbgp.h>
> >
> > @@ -1158,6 +1159,8 @@ void __init setup_arch(char **cmdline_p)
> > initmem_init();
> > dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
> >
> > + hugetlb_cma_reserve();
> > +
>
> I know this is called from arch specific code here to fit in with the timing
> of CMA setup/reservation calls. However, there really is nothing architecture
> specific about this functionality. It would be great IMO if we could make
> this architecture independent. However, I can not think of a straight forward
> way to do this.

I agree. Unfortunately I have no better idea than having an arch-dependent hook.

>
> > /*
> > * Reserve memory for crash kernel after SRAT is parsed so that it
> > * won't consume hotpluggable memory.
> <snip>
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> <snip>
> > +void __init hugetlb_cma_reserve(void)
> > +{
> > + unsigned long totalpages = 0;
> > + unsigned long start_pfn, end_pfn;
> > + phys_addr_t size;
> > + int nid, i, res;
> > +
> > + if (!hugetlb_cma_size && !hugetlb_cma_percent)
> > + return;
> > +
> > + if (hugetlb_cma_percent) {
> > + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
> > + NULL)
> > + totalpages += end_pfn - start_pfn;
> > +
> > + size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) /
> > + 10000UL;
> > + } else {
> > + size = hugetlb_cma_size;
> > + }
> > +
> > + pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size,
> > + size / nr_online_nodes);
> > +
> > + size /= nr_online_nodes;
> > +
> > + for_each_node_state(nid, N_ONLINE) {
> > + unsigned long min_pfn = 0, max_pfn = 0;
> > +
> > + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > + if (!min_pfn)
> > + min_pfn = start_pfn;
> > + max_pfn = end_pfn;
> > + }
> > +
> > + res = cma_declare_contiguous(PFN_PHYS(min_pfn), size,
> > + PFN_PHYS(max_pfn), (1UL << 30),
>
> The alignment is hard coded for x86 gigantic page size. If this supports
> more architectures or becomes arch independent we will need to determine
> what this alignment should be. Perhaps an arch specific call back to get
> the alignment for gigantic pages. That will require a little thought as
> some arch's support multiple gigantic page sizes.

Good point!
Should we take the biggest possible size as a reference?
Or the smallest (larger than MAX_ORDER)?

Thanks!