Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

From: David Rientjes
Date: Tue Oct 01 2019 - 16:32:19 EST


On Tue, 1 Oct 2019, Vlastimil Babka wrote:

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4ae967bcf954..2c48146f3ee2 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> nmask = policy_nodemask(gfp, pol);
> if (!nmask || node_isset(hpage_node, *nmask)) {
> mpol_cond_put(pol);
> + /*
> + * First, try to allocate THP only on local node, but
> + * don't reclaim unnecessarily, just compact.
> + */
> page = __alloc_pages_node(hpage_node,
> - gfp | __GFP_THISNODE, order);
> + gfp | __GFP_THISNODE | __GFP_NORETRY, order);

The page allocator has heuristics to determine when compaction should be
retried, reclaim should be retried, and the allocation itself should retry
for high-order allocations (see PAGE_ALLOC_COSTLY_ORDER).
PAGE_ALLOC_COSTLY_ORDER exists solely to avoid poor allocator behavior
when reclaim itself is unlikely -- or disruptive enough -- in making that
amount of contiguous memory available.

Rather than papering over the poor feedback loop between compaction and
reclaim that exists in the page allocator, it's probably better to improve
that and determine when an allocation should fail or it's worthwhile to
retry. That's a separate topic from NUMA locality of thp.

In other words, we should likely address how compaction and reclaim is
done for all high order-allocations in the page allocator itself rather
than only here for hugepage allocations and relying on specific gfp bits
to be set. Ask: if the allocation here should not retry regardless of why
compaction failed, why should any high-order allocation (user or kernel)
retry if compaction failed and at what order we should just fail? If
hugetlb wants to stress this to the fullest extent possible, it already
appropriately uses __GFP_RETRY_MAYFAIL.

The code here is saying we care more about NUMA locality than hugepages
simply because that's where the access latency is better and is specific
to hugepages; allocation behavior for high-order pages needs to live in
the page allocator.

>
> /*
> - * If hugepage allocations are configured to always
> - * synchronous compact or the vma has been madvised
> - * to prefer hugepage backing, retry allowing remote
> - * memory as well.
> + * If that fails, allow both compaction and reclaim,
> + * but on all nodes.
> */
> - if (!page && (gfp & __GFP_DIRECT_RECLAIM))
> + if (!page)
> page = __alloc_pages_node(hpage_node,
> - gfp | __GFP_NORETRY, order);
> + gfp, order);
>
> goto out;
> }

We certainly don't want this for non-MADV_HUGEPAGE regions when thp
enabled bit is not set to "always". We'd much rather fallback to local
native pages because of its improved access latency. This is allowing all
hugepage allocations to be remote even without MADV_HUGEPAGE which is not
even what Andrea needs: qemu uses MADV_HUGEPAGE.