Re: [PATCHv6 1/1] mm: optimization on page allocation when CMA enabled

From: Zhaoyang Huang
Date: Wed Nov 08 2023 - 03:55:42 EST


On Wed, Nov 8, 2023 at 1:28 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Mon, Oct 16, 2023 at 03:12:45PM +0800, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
> >
> > According to current CMA utilization policy, an alloc_pages(GFP_USER)
> > could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of
> > CMA(pass zone_watermark_ok by counting CMA in but use U&R in rmqueue),
> > which could lead to following alloc_pages(GFP_KERNEL) fail.
> > Solving this by introducing second watermark checking for GFP_MOVABLE,
> > which could have the allocation use CMA when proper.
> >
> > -- Free_pages(30MB)
> > |
> > |
> > -- WMARK_LOW(25MB)
> > |
> > -- Free_CMA(12MB)
> > |
> > |
> > --
>
> We're running into the same issue in production and had an incident
> over the weekend because of it. The hosts have a raised
> vm.min_free_kbytes for network rx reliability, which makes the
> mismatch between free pages and what's actually allocatable by regular
> kernel requests quite pronounced. It wasn't OOMing this time, but we
> saw very high rates of thrashing while CMA had plenty of headroom.
>
> I had raised the broader issue around poor CMA utilization before:
> https://lore.kernel.org/lkml/20230726145304.1319046-1-hannes@xxxxxxxxxxx/
>
> For context, we're using hugetlb_cma at several gigabytes to allow
> sharing hosts between jobs that use hugetlb and jobs that don't.
>
> > @@ -2078,6 +2078,43 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
> >
> > }
> >
> > +#ifdef CONFIG_CMA
> > +/*
> > + * GFP_MOVABLE allocation could drain UNMOVABLE & RECLAIMABLE page blocks via
> > + * the help of CMA which makes GFP_KERNEL failed. Checking if zone_watermark_ok
> > + * again without ALLOC_CMA to see if to use CMA first.
> > + */
> > +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags)
> > +{
> > + unsigned long watermark;
> > + bool cma_first = false;
> > +
> > + watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
> > + /* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */
> > + if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) {
> > + /*
> > + * Balance movable allocations between regular and CMA areas by
> > + * allocating from CMA when over half of the zone's free memory
> > + * is in the CMA area.
> > + */
ok, thanks for point out.
Could we simple it like this, which will mis-judge kmalloc within
ioctl as GFP_USER. I think it is ok as it is rare
if (current_is_kswapd() || !current->mm)
gfp_flags = GFP_KERNEL;
else
gfp_flags = GFP_USER;
free_pages = zone_page_state(zone, NR_FREE_PAGES);
free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)];
free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
cma_first = free_pages > zone_page_state(zone, NR_FREE_PAGES) / 2);

> > + cma_first = (zone_page_state(zone, NR_FREE_CMA_PAGES) >
> > + zone_page_state(zone, NR_FREE_PAGES) / 2);
> > + } else {
> > + /*
> > + * watermark failed means UNMOVABLE & RECLAIMBLE is not enough
> > + * now, we should use cma first to keep them stay around the
> > + * corresponding watermark
> > + */
> > + cma_first = true;
> > + }
> > + return cma_first;
>
> I think it's a step in the right direction. However, it doesn't take
> the lowmem reserves into account. With DMA32 that can be an additional
> multiple gigabytes of "free" memory not available to GFP_KERNEL. It
> also has a knee in the balancing curve because it doesn't take
> reserves into account *until* non-CMA is depleted - at which point it
> would already be below the use-CMA threshold by the full reserves and
> watermarks.
>
> A more complete solution would have to plumb the highest_zoneidx
> information through the rmqueue family of functions somehow, and
> always take unavailable free memory into account:
>
> ---
> Subject: [PATCH] mm: page_alloc: use CMA when kernel allocations are beginning
> to fail
>
> We can get into a situation where kernel allocations are starting to
> fail on watermarks, but movable allocations still don't use CMA
> because they make up more than half of the free memory. This can
> happen in particular with elevated vm.min_free_kbytes settings, where
> the remaining free pages aren't available to non-atomic requests.
>
> Example scenario:
>
> Free: 3.0G
> Watermarks: 2.0G
> CMA: 1.4G
> -> non-CMA: 1.6G
>
> CMA isn't used because CMA <= free/2. Kernel allocations fail due to
> non-CMA < watermarks. If memory is mostly unreclaimable (e.g. anon
> without swap), the kernel is more likely to OOM prematurely.
>
> Reduce the probability of that happening by taking reserves and
> watermarks into account when deciding whether to start using CMA.
>
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> ---
> mm/page_alloc.c | 93 +++++++++++++++++++++++++++++++------------------
> 1 file changed, 59 insertions(+), 34 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 733732e7e0ba..b9273d7f23b8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2079,30 +2079,52 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
>
> }
>
> +static bool should_try_cma(struct zone *zone, unsigned int order,
> + gfp_t gfp_flags, unsigned int alloc_flags)
> +{
> + long free_pages;
> +
> + if (!IS_ENABLED(CONFIG_CMA) || !(alloc_flags & ALLOC_CMA))
> + return false;
> +
> + /*
> + * CMA regions can be used by movable allocations while
> + * they're not otherwise in use. This is a delicate balance:
> + * Filling CMA too soon poses a latency risk for actual CMA
> + * allocations (think camera app startup). Filling CMA too
> + * late risks premature OOMs from non-movable allocations.
> + *
> + * Start using CMA once it dominates the remaining free
> + * memory. Be sure to take watermarks and reserves into
> + * account when considering what's truly "free".
> + *
> + * free_pages can go negative, but that's okay because
> + * NR_FREE_CMA_PAGES should not.
> + */
> +
> + free_pages = zone_page_state(zone, NR_FREE_PAGES);
> + free_pages -= zone->lowmem_reserve[gfp_zone(gfp_flags)];
> + free_pages -= wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
> +
> + return zone_page_state(zone, NR_FREE_CMA_PAGES) > free_pages / 2;
> +}
> +
> /*
> * Do the hard work of removing an element from the buddy allocator.
> * Call me with the zone->lock already held.
> */
> static __always_inline struct page *
> -__rmqueue(struct zone *zone, unsigned int order, int migratetype,
> - unsigned int alloc_flags)
> +__rmqueue(struct zone *zone, unsigned int order, gfp_t gfp_flags,
> + int migratetype, unsigned int alloc_flags)
> {
> struct page *page;
>
> - if (IS_ENABLED(CONFIG_CMA)) {
> - /*
> - * Balance movable allocations between regular and CMA areas by
> - * allocating from CMA when over half of the zone's free memory
> - * is in the CMA area.
> - */
> - if (alloc_flags & ALLOC_CMA &&
> - zone_page_state(zone, NR_FREE_CMA_PAGES) >
> - zone_page_state(zone, NR_FREE_PAGES) / 2) {
> - page = __rmqueue_cma_fallback(zone, order);
> - if (page)
> - return page;
> - }
> + if (should_try_cma(zone, order, gfp_flags, alloc_flags)) {
> + page = __rmqueue_cma_fallback(zone, order);
> + if (page)
> + return page;
> }
> +
> retry:
> page = __rmqueue_smallest(zone, order, migratetype);
> if (unlikely(!page)) {
> @@ -2121,7 +2143,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
> * a single hold of the lock, for efficiency. Add them to the supplied list.
> * Returns the number of new pages which were placed at *list.
> */
> -static int rmqueue_bulk(struct zone *zone, unsigned int order,
> +static int rmqueue_bulk(struct zone *zone, unsigned int order, gfp_t gfp_flags,
> unsigned long count, struct list_head *list,
> int migratetype, unsigned int alloc_flags)
> {
> @@ -2130,8 +2152,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>
> spin_lock_irqsave(&zone->lock, flags);
> for (i = 0; i < count; ++i) {
> - struct page *page = __rmqueue(zone, order, migratetype,
> - alloc_flags);
> + struct page *page = __rmqueue(zone, order, gfp_flags,
> + migratetype, alloc_flags);
> if (unlikely(page == NULL))
> break;
>
> @@ -2714,8 +2736,8 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
>
> static __always_inline
> struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
> - unsigned int order, unsigned int alloc_flags,
> - int migratetype)
> + unsigned int order, gfp_t gfp_flags,
> + unsigned int alloc_flags, int migratetype)
> {
> struct page *page;
> unsigned long flags;
> @@ -2726,7 +2748,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
> if (alloc_flags & ALLOC_HIGHATOMIC)
> page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> if (!page) {
> - page = __rmqueue(zone, order, migratetype, alloc_flags);
> + page = __rmqueue(zone, order, migratetype,
> + gfp_flags, alloc_flags);
>
> /*
> * If the allocation fails, allow OOM handling access
> @@ -2806,10 +2829,10 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
> /* Remove page from the per-cpu list, caller must protect the list */
> static inline
> struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
> - int migratetype,
> - unsigned int alloc_flags,
> - struct per_cpu_pages *pcp,
> - struct list_head *list)
> + gfp_t gfp_flags, int migratetype,
> + unsigned int alloc_flags,
> + struct per_cpu_pages *pcp,
> + struct list_head *list)
> {
> struct page *page;
>
> @@ -2818,7 +2841,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
> int batch = nr_pcp_alloc(pcp, zone, order);
> int alloced;
>
> - alloced = rmqueue_bulk(zone, order,
> + alloced = rmqueue_bulk(zone, order, gfp_flags,
> batch, list,
> migratetype, alloc_flags);
>
> @@ -2837,8 +2860,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
>
> /* Lock and remove page from the per-cpu list */
> static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> - struct zone *zone, unsigned int order,
> - int migratetype, unsigned int alloc_flags)
> + struct zone *zone, unsigned int order,
> + gfp_t gfp_flags, int migratetype,
> + unsigned int alloc_flags)
> {
> struct per_cpu_pages *pcp;
> struct list_head *list;
> @@ -2860,7 +2884,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> */
> pcp->free_count >>= 1;
> list = &pcp->lists[order_to_pindex(migratetype, order)];
> - page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
> + page = __rmqueue_pcplist(zone, order, gfp_flags, migratetype,
> + alloc_flags, pcp, list);
> pcp_spin_unlock(pcp);
> pcp_trylock_finish(UP_flags);
> if (page) {
> @@ -2898,13 +2923,13 @@ struct page *rmqueue(struct zone *preferred_zone,
>
> if (likely(pcp_allowed_order(order))) {
> page = rmqueue_pcplist(preferred_zone, zone, order,
> - migratetype, alloc_flags);
> + gfp_flags, migratetype, alloc_flags);
> if (likely(page))
> goto out;
> }
>
> - page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
> - migratetype);
> + page = rmqueue_buddy(preferred_zone, zone, order, gfp_flags,
> + alloc_flags, migratetype);
>
> out:
> /* Separate test+clear to avoid unnecessary atomics */
> @@ -4480,8 +4505,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> continue;
> }
>
> - page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
> - pcp, pcp_list);
> + page = __rmqueue_pcplist(zone, 0, gfp, ac.migratetype,
> + alloc_flags, pcp, pcp_list);
> if (unlikely(!page)) {
> /* Try and allocate at least one page */
> if (!nr_account) {
> --
> 2.42.0

>