Re: [Question] Should direct reclaim time be bounded?

From: Mike Kravetz
Date: Wed Jul 03 2019 - 19:55:23 EST


On 7/3/19 2:43 AM, Mel Gorman wrote:
> Indeed. I'm getting knocked offline shortly so I didn't give this the
> time it deserves but it appears that part of this problem is
> hugetlb-specific when one node is full and can enter into this continual
> loop due to __GFP_RETRY_MAYFAIL requiring both nr_reclaimed and
> nr_scanned to be zero.

Yes, I am not aware of any other large order allocations consistently made
with __GFP_RETRY_MAYFAIL. But, I did not look too closely. Michal believes
that hugetlb pages allocations should use __GFP_RETRY_MAYFAIL.

> Have you considered one of the following as an option?
>
> 1. Always use the on-stack nodes_allowed in __nr_hugepages_store_common
> and copy nodes_states if necessary. Add a bool parameter to
> alloc_pool_huge_page that is true when called from set_max_huge_pages.
> If an allocation from alloc_fresh_huge_page, clear the failing node
> from the mask so it's not retried, bail if the mask is empty. The
> consequences are that round-robin allocation of huge pages will be
> different if a node failed to allocate for transient reasons.

That seems to be a more aggressive form of 3 below.

> 2. Alter the condition in should_continue_reclaim for
> __GFP_RETRY_MAYFAIL to consider if nr_scanned < SWAP_CLUSTER_MAX.
> Either raise priority (will interfere with kswapd though) or
> bail entirely. Consequences may be that other __GFP_RETRY_MAYFAIL
> allocations do not want this behaviour. There are a lot of users.

Due to high number of users, I am avoiding such a change. It would be
hard to validate that such a change does not impact other users.

> 3. Move where __GFP_RETRY_MAYFAIL is set in a gfp_mask in mm/hugetlb.c.
> Strip the flag if an allocation fails on a node. Consequences are
> that setting the required number of huge pages is more likely to
> return without all the huge pages set.

We are actually using a form of this in our distro kernel. It works quite
well on the older (4.11 based) distro kernel. My plan was to push this
upstream. However, when I tested this on recent upstream kernels, I
encountered long stalls associated with the first __GFP_RETRY_MAYFAIL
allocation failure. That is what prompted me to ask this queastion/start
this thread. The distro kernel would see stalls taking tens of seconds,
upstream would see stalls of several minutes. Much has changed since 4.11,
so I was trying to figure out what might be causing this change in behavior.

BTW, here is the patch I was testing. It actually has additional code to
switch between __GFP_RETRY_MAYFAIL and __GFP_NORETRY and back to hopefully
take into account transient conditions.