Re: [PATCH] mm, hugetlb: allow hugepage allocations to excessively reclaim

From: Mike Kravetz
Date: Mon Oct 07 2019 - 15:03:49 EST


On 10/7/19 12:55 AM, Michal Hocko wrote:
> From: David Rientjes <rientjes@xxxxxxxxxx>
>
> b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
> may not succeed") has chnaged the allocator to bail out from the
> allocator early to prevent from a potentially excessive memory
> reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation,
> reclaim and compaction loop as long as there is a reasonable chance to
> make a forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED
> at the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.
>
> The most obvious affected subsystem is hugetlbfs which allocates huge
> pages based on an admin request (or via admin configured overcommit).
> I have done a simple test which tries to allocate half of the memory
> for hugetlb pages while the memory is full of a clean page cache. This
> is not an unusual situation because we try to cache as much of the
> memory as possible and sysctl/sysfs interface to allocate huge pages is
> there for flexibility to allocate hugetlb pages at any time.
>
> System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
> after the memory is prefilled by a clean page cache:
> root@test1:~# cat hugetlb_test.sh
>
> set -x
> echo 0 > /proc/sys/vm/nr_hugepages
> echo 3 > /proc/sys/vm/drop_caches
> echo 1 > /proc/sys/vm/compact_memory
> dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
> TS=$(date +%s)
> echo 256 > /proc/sys/vm/nr_hugepages
> cat /proc/sys/vm/nr_hugepages
>
> The results for 2 consecutive runs on clean 5.3
> root@test1:~# sh hugetlb_test.sh
> + echo 0
> + echo 3
> + echo 1
> + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
> + date +%s
> + TS=1569905284
> + echo 256
> + cat /proc/sys/vm/nr_hugepages
> 256
> root@test1:~# sh hugetlb_test.sh
> + echo 0
> + echo 3
> + echo 1
> + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
> + date +%s
> + TS=1569905311
> + echo 256
> + cat /proc/sys/vm/nr_hugepages
> 256
>
> Now with b39d0ee2632d applied
> root@test1:~# sh hugetlb_test.sh
> + echo 0
> + echo 3
> + echo 1
> + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
> + date +%s
> + TS=1569905516
> + echo 256
> + cat /proc/sys/vm/nr_hugepages
> 11
> root@test1:~# sh hugetlb_test.sh
> + echo 0
> + echo 3
> + echo 1
> + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
> + date +%s
> + TS=1569905541
> + echo 256
> + cat /proc/sys/vm/nr_hugepages
> 12
>
> The success rate went down by factor of 20!
>
> Although hugetlb allocation requests might fail and it is reasonable to
> expect them to under extremely fragmented memory or when the memory is
> under a heavy pressure but the above situation is not that case.
>
> Fix the regression by reverting back to the previous behavior for
> __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
> those requests.

Thank you Michal for doing this.

hugetlbfs allocations are commonly done via sysctl/sysfs shortly after boot
where this may not be as much of an issue. However, I am aware of at least
three use cases where allocations are made after the system has been up and
running for quite some time:
- DB reconfiguration. If sysctl/sysfs fails to get required number of huge
pages, system is rebooted to perform allocation after boot.
- VM provisioning. If unable get required number of huge pages, fall back
to base pages.
- An application that does not preallocate pool, but rather allocates pages
at fault time for optimal NUMA locality.
In all cases, I would expect b39d0ee2632d to cause regressions and noticable
behavior changes.

My quick/limited testing in [1] was insufficient. It was also mentioned that
if something like b39d0ee2632d went forward, I would like exemptions for
__GFP_RETRY_MAYFAIL requests as in this patch.

>
> [mhocko@xxxxxxxx: reworded changelog]
> Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx>
> Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>

FWIW,
Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>

[1] https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@xxxxxxxxxx
--
Mike Kravetz