[Question] Should direct reclaim time be bounded?

From: Mike Kravetz
Date: Tue Apr 23 2019 - 00:08:00 EST


I was looking into an issue on our distro kernel where allocation of huge
pages via "echo X > /proc/sys/vm/nr_hugepages" was taking a LONG time.
In this particular case, we were actually allocating huge pages VERY slowly
at the rate of about one every 30 seconds. I don't want to talk about the
code in our distro kernel, but the situation that caused this issue exists
upstream and appears to be worse there.

One thing to note is that hugetlb page allocation can really stress the
page allocator. The routine alloc_pool_huge_page is of special concern.

/*
* Allocates a fresh page to the hugetlb allocator pool in the node interleaved
* manner.
*/
static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
{
struct page *page;
int nr_nodes, node;
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;

for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed);
if (page)
break;
}

if (!page)
return 0;

put_page(page); /* free it into the hugepage allocator */

return 1;
}

This routine is called for each huge page the user wants to allocate. If
they do "echo 4096 > nr_hugepages", this is called 4096 times.
alloc_fresh_huge_page() will eventually call __alloc_pages_nodemask with
__GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN in addition to __GFP_THISNODE.
That for_each_node_mask_to_alloc() macro is hugetlbfs specific and attempts
to allocate huge pages in a round robin fashion. When asked to allocate a
huge page, it first tries the 'next_nid_to_alloc'. If that fails, it goes
to the next allowed node. This is 'documented' in kernel docs as:

"On a NUMA platform, the kernel will attempt to distribute the huge page pool
over all the set of allowed nodes specified by the NUMA memory policy of the
task that modifies nr_hugepages. The default for the allowed nodes--when the
task has default memory policy--is all on-line nodes with memory. Allowed
nodes with insufficient available, contiguous memory for a huge page will be
silently skipped when allocating persistent huge pages. See the discussion
below of the interaction of task memory policy, cpusets and per node attributes
with the allocation and freeing of persistent huge pages.

The success or failure of huge page allocation depends on the amount of
physically contiguous memory that is present in system at the time of the
allocation attempt. If the kernel is unable to allocate huge pages from
some nodes in a NUMA system, it will attempt to make up the difference by
allocating extra pages on other nodes with sufficient available contiguous
memory, if any."

However, consider the case of a 2 node system where:
node 0 has 2GB memory
node 1 has 4GB memory

Now, if one wants to allocate 4GB of huge pages they may be tempted to simply,
"echo 2048 > nr_hugepages". At first this will go well until node 0 is out
of memory. When this happens, alloc_pool_huge_page() will continue to be
called. Because of that for_each_node_mask_to_alloc() macro, it will likely
attempt to first allocate a page from node 0. It will call direct reclaim and
compaction until it fails. Then, it will successfully allocate from node 1.

In our distro kernel, I am thinking about making allocations try "less hard"
on nodes where we start to see failures. less hard == NORETRY/NORECLAIM.
I was going to try something like this on an upstream kernel when I noticed
that it seems like direct reclaim may never end/exit. It 'may' exit, but I
instrumented __alloc_pages_slowpath() and saw it take well over an hour
before I 'tricked' it into exiting.

[ 5916.248341] hpage_slow_alloc: jiffies 5295742 tries 2 node 0 success
[ 5916.249271] reclaim 5295741 compact 1

This is where it stalled after "echo 4096 > nr_hugepages" on a little VM
with 8GB total memory.

I have not started looking at the direct reclaim code to see exactly where
we may be stuck, or trying really hard. My question is, "Is this expected
or should direct reclaim be somewhat bounded?" With __alloc_pages_slowpath
getting 'stuck' in direct reclaim, the documented behavior for huge page
allocation is not going to happen.
--
Mike Kravetz