Re: [PATCH v3 00/21] Free some vmemmap pages of hugetlb page

From: Mike Kravetz
Date: Tue Nov 10 2020 - 14:24:36 EST



Thanks for continuing to work this Muchun!

On 11/8/20 6:10 AM, Muchun Song wrote:
...
> For tail pages, the value of compound_head is the same. So we can reuse
> first page of tail page structs. We map the virtual addresses of the
> remaining 6 pages of tail page structs to the first tail page struct,
> and then free these 6 pages. Therefore, we need to reserve at least 2
> pages as vmemmap areas.
>
> When a hugetlbpage is freed to the buddy system, we should allocate six
> pages for vmemmap pages and restore the previous mapping relationship.
>
> If we uses the 1G hugetlbpage, we can save 4095 pages. This is a very
> substantial gain.

Is that 4095 number accurate? Are we not using two pages of struct pages
as in the 2MB case?

Also, because we are splitting the huge page mappings in the vmemmap
additional PTE pages will need to be allocated. Therefore, some additional
page table pages may need to be allocated so that we can free the pages
of struct pages. The net savings may be less than what is stated above.

Perhaps this should mention that allocation of additional page table pages
may be required?

...
> Because there are vmemmap page tables reconstruction on the freeing/allocating
> path, it increases some overhead. Here are some overhead analysis.
>
> 1) Allocating 10240 2MB hugetlb pages.
>
> a) With this patch series applied:
> # time echo 10240 > /proc/sys/vm/nr_hugepages
>
> real 0m0.166s
> user 0m0.000s
> sys 0m0.166s
>
> # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> Attaching 2 probes...
>
> @latency:
> [8K, 16K) 8360 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16K, 32K) 1868 |@@@@@@@@@@@ |
> [32K, 64K) 10 | |
> [64K, 128K) 2 | |
>
> b) Without this patch series:
> # time echo 10240 > /proc/sys/vm/nr_hugepages
>
> real 0m0.066s
> user 0m0.000s
> sys 0m0.066s
>
> # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> Attaching 2 probes...
>
> @latency:
> [4K, 8K) 10176 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8K, 16K) 62 | |
> [16K, 32K) 2 | |
>
> Summarize: this feature is about ~2x slower than before.
>
> 2) Freeing 10240 @MB hugetlb pages.
>
> a) With this patch series applied:
> # time echo 0 > /proc/sys/vm/nr_hugepages
>
> real 0m0.004s
> user 0m0.000s
> sys 0m0.002s
>
> # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> Attaching 2 probes...
>
> @latency:
> [16K, 32K) 10240 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> b) Without this patch series:
> # time echo 0 > /proc/sys/vm/nr_hugepages
>
> real 0m0.077s
> user 0m0.001s
> sys 0m0.075s
>
> # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> Attaching 2 probes...
>
> @latency:
> [4K, 8K) 9950 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8K, 16K) 287 |@ |
> [16K, 32K) 3 | |
>
> Summarize: The overhead of __free_hugepage is about ~2-4x slower than before.
> But according to the allocation test above, I think that here is
> also ~2x slower than before.
>
> But why the 'real' time of patched is smaller than before? Because
> In this patch series, the freeing hugetlb is asynchronous(through
> kwoker).
>
> Although the overhead has increased. But the overhead is not on the
> allocating/freeing of each hugetlb page, it is only once when we reserve
> some hugetlb pages through /proc/sys/vm/nr_hugepages. Once the reservation
> is successful, the subsequent allocating, freeing and using are the same
> as before (not patched). So I think that the overhead is acceptable.

Thank you for benchmarking. There are still some instances where huge pages
are allocated 'on the fly' instead of being pulled from the pool. Michal
pointed out the case of page migration. It is also possible for someone to
use hugetlbfs without pre-allocating huge pages to the pool. I remember the
use case pointed out in commit 099730d67417. It says, "I have a hugetlbfs
user which is never explicitly allocating huge pages with 'nr_hugepages'.
They only set 'nr_overcommit_hugepages' and then let the pages be allocated
from the buddy allocator at fault time." In this case, I suspect they were
using 'page fault' allocation for initialization much like someone using
/proc/sys/vm/nr_hugepages. So, the overhead may not be as noticeable.

--
Mike Kravetz