Re: [PATCH] MM: discard __GFP_ATOMIC

From: Mel Gorman
Date: Tue Oct 18 2022 - 08:11:13 EST


On Sun, Oct 16, 2022 at 07:38:27PM -0700, Andrew Morton wrote:
> On Wed, 7 Sep 2022 10:47:24 +0100 Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> > On Tue, Sep 06, 2022 at 09:35:41AM +0200, Michal Hocko wrote:
> > > > From: "NeilBrown" <neilb@xxxxxxx>
> > > > Subject: mm: discard __GFP_ATOMIC
> > > >
> > > > __GFP_ATOMIC serves little purpose. Its main effect is to set
> > > > ALLOC_HARDER which adds a few little boosts to increase the chance of an
> > > > allocation succeeding, one of which is to lower the water-mark at which it
> > > > will succeed.
> > > >
> > > > It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
> > > > adjusts this watermark. It is probable that other users of __GFP_HIGH
> > > > should benefit from the other little bonuses that __GFP_ATOMIC gets.
> > > >
> > > > __GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM.
> > > > There is little point to this. We already get a might_sleep() warning if
> > > > __GFP_DIRECT_RECLAIM is set.
> > > >
> > > > __GFP_ATOMIC allows the "watermark_boost" to be side-stepped. It is
> > > > probable that testing ALLOC_HARDER is a better fit here.
> > > >
> > > > __GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
> > > > sleep. This should test __GFP_DIRECT_RECLAIM instead.
> > > >
> > > > This patch:
> > > > - removes __GFP_ATOMIC
> > > > - causes __GFP_HIGH to set ALLOC_HARDER unless __GFP_NOMEMALLOC is set
> > > > (as well as ALLOC_HIGH).
> > > > - makes other adjustments as suggested by the above.
> > > >
> > > > The net result is not change to GFP_ATOMIC allocations. Other
> > > > allocations that use __GFP_HIGH will benefit from a few different extra
> > > > privileges. This affects:
> > > > xen, dm, md, ntfs3
> > > > the vermillion frame buffer
> > > > hibernation
> > > > ksm
> > > > swap
> > > > all of which likely produce more benefit than cost if these selected
> > > > allocation are more likely to succeed quickly.
> > >
> > > This is a good summary of the current usage and existing issues. It also
> > > shows that the naming is tricky and allows people to make wrong calls
> > > (tegra-smmu.c). I also thing that it is wrong to couple memory reserves
> > > access to the reclaim constrains/expectations of the caller.
> > >
> >
> > I think it's worth trying to get rid of __GFP_ATOMIC although this patch
> > needs to be rebased. Without rebasing it, I suspect there is a corner case
> > for reserving high order atomic blocks. A high-order atomic allocation
> > might get confused with a __GFP_HIGH high-order allocation that can sleep.
> > It would not be completely irrational to have such a caller if it was in a
> > path that can tolerate a stall but stalling might have visible consequences.
> > I'm also worried that the patch might allow __GFP_HIGH to ignore cpusets
> > which is probably not intended by direct users like ksm.
>
> Unclear what you mean by "rebased".

I was looking at the original patch, not what was in mm-unstable.

> You're saying the patch might have
> issues - doesn't that mean it needs to be "fixed"?
>

Yes, I think it needs some follow-on work.

> Anyway, I've been maintaining this change for nearly a year - if
> nothing happens soon I guess I'll drop it so it doesn't get in people's
> way.

I think it's easier to highlight my concerns by renaming ALLOC_HARDER
to ALLOC_MIN_RESERVE because that's effectively what it means. It then
becomes a bit more obvious where the problems might be.

They're all fixable but do people agree that these problems are real or
my imagination?

---
mm/internal.h | 2 +-
mm/page_alloc.c | 51 ++++++++++++++++++++++++++++++++++++++++-----------
2 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 4b44ced87fff..8c7fd034b277 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -743,7 +743,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
#define ALLOC_OOM ALLOC_NO_WATERMARKS
#endif

-#define ALLOC_HARDER 0x10 /* try to alloc harder */
+#define ALLOC_MIN_RESERVE 0x10 /* allow allocations below min watermark */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
#define ALLOC_CMA 0x80 /* allow allocations from CMA areas */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1a9a844dc197..9ff4cd4c5d11 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3708,8 +3708,16 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
* due to non-CMA allocation context. HIGHATOMIC area is
* reserved for high-order atomic allocation, so order-0
* request should skip it.
+ *
+ * BZZT: __GFP_HIGH sets ALLOC_MIN_RESERVE and while __GFP_HIGH
+ * is set for GFP_ATOMIC, it does not mean that all
+ * __GFP_HIGH allocations also do not specify
+ * _GFP_DIRECT_RECLAIM. While there do not appear to
+ * be any in-tree users that allocation high-order pages
+ * using __GFP_HIGH && !__GFP_DIRECT_RECLAIM, it's a
+ * potential issue.
*/
- if (order > 0 && alloc_flags & ALLOC_HARDER)
+ if (order > 0 && alloc_flags & ALLOC_MIN_RESERVE)
page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if (!page) {
page = __rmqueue(zone, order, migratetype, alloc_flags);
@@ -3943,11 +3951,11 @@ ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);
static inline long __zone_watermark_unusable_free(struct zone *z,
unsigned int order, unsigned int alloc_flags)
{
- const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
+ const bool alloc_harder = (alloc_flags & (ALLOC_MIN_RESERVE|ALLOC_OOM));
long unusable_free = (1 << order) - 1;

/*
- * If the caller does not have rights to ALLOC_HARDER then subtract
+ * If the caller does not have rights to ALLOC_MIN_RESERVE then subtract
* the high-atomic reserves. This will over-estimate the size of the
* atomic reserve but it avoids a search.
*/
@@ -3975,17 +3983,21 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
{
long min = mark;
int o;
- const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
+ const bool alloc_harder = (alloc_flags & (ALLOC_MIN_RESERVE|ALLOC_OOM));

/* free_pages may go negative - that's OK */
free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);

+ /*
+ * BZZT: ALLOC_HIGH is now effectively an alias of ALLOC_MIN_RESERVE
+ * so we end up double dipping below.
+ */
if (alloc_flags & ALLOC_HIGH)
min -= min / 2;

if (unlikely(alloc_harder)) {
/*
- * OOM victims can try even harder than normal ALLOC_HARDER
+ * OOM victims can try even harder than normal ALLOC_MIN_RESERVE
* users on the grounds that it's definitely going to be in
* the exit path shortly and free memory. Any allocation it
* makes during the free path will be small and short-lived.
@@ -4027,6 +4039,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
return true;
}
#endif
+ /* BZZT: ALLOC_MIN_RESERVE does not imply access to highatomic reserves */
if (alloc_harder && !free_area_empty(area, MIGRATE_HIGHATOMIC))
return true;
}
@@ -4069,12 +4082,12 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
free_pages))
return true;
/*
- * Ignore watermark boosting for GFP_HIGH order-0 allocations
+ * Ignore watermark boosting for __GFP_HIGH order-0 allocations
* when checking the min watermark. The min watermark is the
* point where boosting is ignored so that kswapd is woken up
* when below the low watermark.
*/
- if (unlikely(!order && (alloc_flags & ALLOC_HARDER) && z->watermark_boost
+ if (unlikely(!order && (alloc_flags & ALLOC_MIN_RESERVE) && z->watermark_boost
&& ((alloc_flags & ALLOC_WMARK_MASK) == WMARK_MIN))) {
mark = z->_watermark[WMARK_MIN];
return __zone_watermark_ok(z, order, mark, highest_zoneidx,
@@ -4289,8 +4302,10 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
/*
* If this is a high-order atomic allocation then check
* if the pageblock should be reserved for the future
+ *
+ * BZZT: __GFP_HIGH does not imply high-order atomic
*/
- if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
+ if (unlikely(order && (alloc_flags & ALLOC_MIN_RESERVE)))
reserve_highatomic_pageblock(page, zone, order);

return page;
@@ -4834,6 +4849,10 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
* cannot run direct reclaim, or if the caller has realtime scheduling
* policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
* set both ALLOC_HARDER (unless __GFP_NOMEMALLOC) and ALLOC_HIGH.
+ *
+ * BZZT: Minor conflation issue. __GFP_HIGH == ALLOC_HIGH but they are
+ * now very similar to each other. Probably should remove
+ * ALLOC_HIGH and modify __GFP_HIGH == ALLOC_MIN_RESERVE
*/
alloc_flags |= (__force int)
(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
@@ -4844,14 +4863,24 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
* if it can't schedule.
*/
if (!(gfp_mask & __GFP_NOMEMALLOC))
- alloc_flags |= ALLOC_HARDER;
+ alloc_flags |= ALLOC_MIN_RESERVE;
/*
* Ignore cpuset mems for GFP_ATOMIC rather than fail, see the
* comment for __cpuset_node_allowed().
+ *
+ * BZZT: Not necessarily "atomic", should have checked
+ * __GFP_DIRECT_RECLAIM? This is a matter of
+ * definition as __GFP_HIGH is documented as being
+ * necessary for the system to make progress.
+ * Either enforce cpusets or update __GFP_HIGH
+ * documentation stating that CPUSETS may be
+ * broken. Based on the current __GFP_HIGH
+ * direct users, documenting that cpusets can
+ * be broken appears to be the safest option.
*/
alloc_flags &= ~ALLOC_CPUSET;
} else if (unlikely(rt_task(current)) && in_task())
- alloc_flags |= ALLOC_HARDER;
+ alloc_flags |= ALLOC_MIN_RESERVE;

alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);

@@ -5277,7 +5306,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* could deplete whole memory reserves which would just make
* the situation worse
*/
- page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
+ page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_MIN_RESERVE, ac);
if (page)
goto got_pg;