Re: [External] Re: [PATCH v3 00/21] Free some vmemmap pages of hugetlb page

From: Muchun Song
Date: Tue Nov 10 2020 - 22:21:48 EST


On Wed, Nov 11, 2020 at 3:23 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote:
>
>
> Thanks for continuing to work this Muchun!
>
> On 11/8/20 6:10 AM, Muchun Song wrote:
> ...
> > For tail pages, the value of compound_head is the same. So we can reuse
> > first page of tail page structs. We map the virtual addresses of the
> > remaining 6 pages of tail page structs to the first tail page struct,
> > and then free these 6 pages. Therefore, we need to reserve at least 2
> > pages as vmemmap areas.
> >
> > When a hugetlbpage is freed to the buddy system, we should allocate six
> > pages for vmemmap pages and restore the previous mapping relationship.
> >
> > If we uses the 1G hugetlbpage, we can save 4095 pages. This is a very
> > substantial gain.
>
> Is that 4095 number accurate? Are we not using two pages of struct pages
> as in the 2MB case?

Oh, yeah, here should be 4094 and subtract page tables. For a 1GB
HugeTLB page, it should be 4086 pages. Thanks for pointing out
this problem.

>
> Also, because we are splitting the huge page mappings in the vmemmap
> additional PTE pages will need to be allocated. Therefore, some additional
> page table pages may need to be allocated so that we can free the pages
> of struct pages. The net savings may be less than what is stated above.
>
> Perhaps this should mention that allocation of additional page table pages
> may be required?

Yeah, you are right. In the later patch, I will rework the analysis
here. Make it
more clear and accurate.

>
> ...
> > Because there are vmemmap page tables reconstruction on the freeing/allocating
> > path, it increases some overhead. Here are some overhead analysis.
> >
> > 1) Allocating 10240 2MB hugetlb pages.
> >
> > a) With this patch series applied:
> > # time echo 10240 > /proc/sys/vm/nr_hugepages
> >
> > real 0m0.166s
> > user 0m0.000s
> > sys 0m0.166s
> >
> > # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> > Attaching 2 probes...
> >
> > @latency:
> > [8K, 16K) 8360 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [16K, 32K) 1868 |@@@@@@@@@@@ |
> > [32K, 64K) 10 | |
> > [64K, 128K) 2 | |
> >
> > b) Without this patch series:
> > # time echo 10240 > /proc/sys/vm/nr_hugepages
> >
> > real 0m0.066s
> > user 0m0.000s
> > sys 0m0.066s
> >
> > # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> > Attaching 2 probes...
> >
> > @latency:
> > [4K, 8K) 10176 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [8K, 16K) 62 | |
> > [16K, 32K) 2 | |
> >
> > Summarize: this feature is about ~2x slower than before.
> >
> > 2) Freeing 10240 @MB hugetlb pages.
> >
> > a) With this patch series applied:
> > # time echo 0 > /proc/sys/vm/nr_hugepages
> >
> > real 0m0.004s
> > user 0m0.000s
> > sys 0m0.002s
> >
> > # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> > Attaching 2 probes...
> >
> > @latency:
> > [16K, 32K) 10240 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >
> > b) Without this patch series:
> > # time echo 0 > /proc/sys/vm/nr_hugepages
> >
> > real 0m0.077s
> > user 0m0.001s
> > sys 0m0.075s
> >
> > # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> > Attaching 2 probes...
> >
> > @latency:
> > [4K, 8K) 9950 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [8K, 16K) 287 |@ |
> > [16K, 32K) 3 | |
> >
> > Summarize: The overhead of __free_hugepage is about ~2-4x slower than before.
> > But according to the allocation test above, I think that here is
> > also ~2x slower than before.
> >
> > But why the 'real' time of patched is smaller than before? Because
> > In this patch series, the freeing hugetlb is asynchronous(through
> > kwoker).
> >
> > Although the overhead has increased. But the overhead is not on the
> > allocating/freeing of each hugetlb page, it is only once when we reserve
> > some hugetlb pages through /proc/sys/vm/nr_hugepages. Once the reservation
> > is successful, the subsequent allocating, freeing and using are the same
> > as before (not patched). So I think that the overhead is acceptable.
>
> Thank you for benchmarking. There are still some instances where huge pages
> are allocated 'on the fly' instead of being pulled from the pool. Michal
> pointed out the case of page migration. It is also possible for someone to
> use hugetlbfs without pre-allocating huge pages to the pool. I remember the
> use case pointed out in commit 099730d67417. It says, "I have a hugetlbfs
> user which is never explicitly allocating huge pages with 'nr_hugepages'.
> They only set 'nr_overcommit_hugepages' and then let the pages be allocated
> from the buddy allocator at fault time." In this case, I suspect they were
> using 'page fault' allocation for initialization much like someone using
> /proc/sys/vm/nr_hugepages. So, the overhead may not be as noticeable.

Thanks for pointing out this using case.

>
> --
> Mike Kravetz



--
Yours,
Muchun