Re: [PATCH 5/5] mm/page_alloc: Limit number of high-order pages on PCP during bulk free

From: Vlastimil Babka
Date: Wed Feb 16 2022 - 09:37:58 EST

Next message: Marc Zyngier: "Re: [PATCH v5 1/3] cpuidle: psci: Call cpu_cluster_pm_enter() on the last CPU"
Previous message: Laurent Vivier: "Re: [PATCH v14 4/5] clocksource/drivers: Add a goldfish-timer clocksource"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2/15/22 15:51, Mel Gorman wrote:
> When a PCP is mostly used for frees then high-order pages can exist on PCP
> lists for some time. This is problematic when the allocation pattern is all
> allocations from one CPU and all frees from another resulting in colder
> pages being used. When bulk freeing pages, limit the number of high-order
> pages that are stored on the PCP lists.
>
> Netperf running on localhost exhibits this pattern and while it does
> not matter for some machines, it does matter for others with smaller
> caches where cache misses cause problems due to reduced page reuse.
> Pages freed directly to the buddy list may be reused quickly while still
> cache hot where as storing on the PCP lists may be cold by the time
> free_pcppages_bulk() is called.
>
> Using perf kmem:mm_page_alloc, the 5 most used page frames were
>
> 5.17-rc3
> 13041 pfn=0x111a30
> 13081 pfn=0x5814d0
> 13097 pfn=0x108258
> 13121 pfn=0x689598
> 13128 pfn=0x5814d8
>
> 5.17-revert-highpcp
> 192009 pfn=0x54c140
> 195426 pfn=0x1081d0
> 200908 pfn=0x61c808
> 243515 pfn=0xa9dc20
> 402523 pfn=0x222bb8
>
> 5.17-full-series
> 142693 pfn=0x346208
> 162227 pfn=0x13bf08
> 166413 pfn=0x2711e0
> 166950 pfn=0x2702f8
>
> The spread is wider as there is still time before pages freed to one
> PCP get released with a tradeoff between fast reuse and reduced zone
> lock acquisition.
>
> From the machine used to gather the traces, the headline performance
> was equivalent.
>
> netperf-tcp
> 5.17.0-rc3 5.17.0-rc3 5.17.0-rc3
> vanilla mm-reverthighpcp-v1r1 mm-highpcplimit-v1r12
> Hmean 64 839.93 ( 0.00%) 840.77 ( 0.10%) 835.34 * -0.55%*
> Hmean 128 1614.22 ( 0.00%) 1622.07 * 0.49%* 1604.18 * -0.62%*
> Hmean 256 2952.00 ( 0.00%) 2953.19 ( 0.04%) 2959.46 ( 0.25%)
> Hmean 1024 10291.67 ( 0.00%) 10239.17 ( -0.51%) 10287.05 ( -0.04%)
> Hmean 2048 17335.08 ( 0.00%) 17399.97 ( 0.37%) 17125.73 * -1.21%*
> Hmean 3312 22628.15 ( 0.00%) 22471.97 ( -0.69%) 22414.24 * -0.95%*
> Hmean 4096 25009.50 ( 0.00%) 24752.83 * -1.03%* 24620.03 * -1.56%*
> Hmean 8192 32745.01 ( 0.00%) 31682.63 * -3.24%* 32475.31 ( -0.82%)
> Hmean 16384 39759.59 ( 0.00%) 36805.78 * -7.43%* 39291.42 ( -1.18%)
>
> From a 1-socket skylake machine with a small CPU cache that suffers
> more if cache misses are too high
>
> netperf-tcp
> 5.17.0-rc3 5.17.0-rc3 5.17.0-rc3
> vanilla mm-reverthighpcp-v1 mm-highpcplimit-v1
> Min 64 935.38 ( 0.00%) 939.40 ( 0.43%) 940.11 ( 0.51%)
> Min 128 1831.69 ( 0.00%) 1856.15 ( 1.34%) 1849.30 ( 0.96%)
> Min 256 3560.61 ( 0.00%) 3659.25 ( 2.77%) 3654.12 ( 2.63%)
> Min 1024 13165.24 ( 0.00%) 13444.74 ( 2.12%) 13281.71 ( 0.88%)
> Min 2048 22706.44 ( 0.00%) 23219.67 ( 2.26%) 23027.31 ( 1.41%)
> Min 3312 30960.26 ( 0.00%) 31985.01 ( 3.31%) 31484.40 ( 1.69%)
> Min 4096 35149.03 ( 0.00%) 35997.44 ( 2.41%) 35891.92 ( 2.11%)
> Min 8192 48064.73 ( 0.00%) 49574.05 ( 3.14%) 48928.89 ( 1.80%)
> Min 16384 58017.25 ( 0.00%) 60352.93 ( 4.03%) 60691.14 ( 4.61%)
> Hmean 64 938.95 ( 0.00%) 941.50 * 0.27%* 940.47 ( 0.16%)
> Hmean 128 1843.10 ( 0.00%) 1857.58 * 0.79%* 1855.83 * 0.69%*
> Hmean 256 3573.07 ( 0.00%) 3667.45 * 2.64%* 3662.08 * 2.49%*
> Hmean 1024 13206.52 ( 0.00%) 13487.80 * 2.13%* 13351.11 * 1.09%*
> Hmean 2048 22870.23 ( 0.00%) 23337.96 * 2.05%* 23149.68 * 1.22%*
> Hmean 3312 31001.99 ( 0.00%) 32206.50 * 3.89%* 31849.40 * 2.73%*
> Hmean 4096 35364.59 ( 0.00%) 36490.96 * 3.19%* 36112.91 * 2.12%*
> Hmean 8192 48497.71 ( 0.00%) 49954.05 * 3.00%* 49384.50 * 1.83%*
> Hmean 16384 58410.86 ( 0.00%) 60839.80 * 4.16%* 61362.12 * 5.05%*
>
> Note that this was a machine that did not benefit from caching high-order
> pages and performance is almost restored with the series applied. It's not
> fully restored as cache misses are still higher. This is a trade-off
> between optimising for a workload that does all allocs on one CPU and frees
> on another or more general workloads that need high-order pages for SLUB
> and benefit from avoiding zone->lock for every SLUB refill/drain.
>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>

Reviewed-by: Vlastimil Babka <vbabka@xxxxxxx>

> ---
> mm/page_alloc.c | 26 +++++++++++++++++++++-----
> 1 file changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6881175b27df..cfb3cbad152c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3314,10 +3314,15 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn,
> return true;
> }
>
> -static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch)
> +static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch,
> + bool free_high)
> {
> int min_nr_free, max_nr_free;
>
> + /* Free everything if batch freeing high-order pages. */
> + if (unlikely(free_high))
> + return pcp->count;
> +
> /* Check for PCP disabled or boot pageset */
> if (unlikely(high < batch))
> return 1;
> @@ -3338,11 +3343,12 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch)
> return batch;
> }
>
> -static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone)
> +static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
> + bool free_high)
> {
> int high = READ_ONCE(pcp->high);
>
> - if (unlikely(!high))
> + if (unlikely(!high || free_high))
> return 0;
>
> if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
> @@ -3362,17 +3368,27 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn,
> struct per_cpu_pages *pcp;
> int high;
> int pindex;
> + bool free_high;
>
> __count_vm_event(PGFREE);
> pcp = this_cpu_ptr(zone->per_cpu_pageset);
> pindex = order_to_pindex(migratetype, order);
> list_add(&page->lru, &pcp->lists[pindex]);
> pcp->count += 1 << order;
> - high = nr_pcp_high(pcp, zone);
> +
> + /*
> + * As high-order pages other than THP's stored on PCP can contribute
> + * to fragmentation, limit the number stored when PCP is heavily
> + * freeing without allocation. The remainder after bulk freeing
> + * stops will be drained from vmstat refresh context.
> + */
> + free_high = (pcp->free_factor && order && order <= PAGE_ALLOC_COSTLY_ORDER);
> +
> + high = nr_pcp_high(pcp, zone, free_high);
> if (pcp->count >= high) {
> int batch = READ_ONCE(pcp->batch);
>
> - free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp, pindex);
> + free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch, free_high), pcp, pindex);
> }
> }
>

Next message: Marc Zyngier: "Re: [PATCH v5 1/3] cpuidle: psci: Call cpu_cluster_pm_enter() on the last CPU"
Previous message: Laurent Vivier: "Re: [PATCH v14 4/5] clocksource/drivers: Add a goldfish-timer clocksource"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]