Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v4

From: Michal Hocko
Date: Thu Dec 01 2016 - 08:42:32 EST


On Thu 01-12-16 00:24:40, Mel Gorman wrote:
> Changelog since v3
> o Allow high-order atomic allocations to use reserves
>
> Changelog since v2
> o Correct initialisation to avoid -Woverflow warning
>
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
>
> o New per-cpu lists are added to cache the high-order pages. This increases
> the cache footprint of the per-cpu allocator and overall usage but for
> some workloads, this will be offset by reduced contention on zone->lock.
> The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
> remaining are high-order caches up to and including
> PAGE_ALLOC_COSTLY_ORDER
>
> o pcp accounting during free is now confined to free_pcppages_bulk as it's
> impossible for the caller to know exactly how many pages were freed.
> Due to the high-order caches, the number of pages drained for a request
> is no longer precise.
>
> o The high watermark for per-cpu pages is increased to reduce the probability
> that a single refill causes a drain on the next free.
>
> The benefit depends on both the workload and the machine as ultimately the
> determining factor is whether cache line bounces on zone->lock or contention
> is a problem. The patch was tested on a variety of workloads and machines,
> some of which are reported here.
>
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.
>
> 2-socket modern machine
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v4
> Hmean send-64 178.38 ( 0.00%) 259.91 ( 45.70%)
> Hmean send-128 351.49 ( 0.00%) 532.46 ( 51.49%)
> Hmean send-256 671.23 ( 0.00%) 1022.99 ( 52.40%)
> Hmean send-1024 2663.60 ( 0.00%) 3995.17 ( 49.99%)
> Hmean send-2048 5126.53 ( 0.00%) 7587.05 ( 48.00%)
> Hmean send-3312 7949.99 ( 0.00%) 11697.16 ( 47.13%)
> Hmean send-4096 9433.56 ( 0.00%) 13188.14 ( 39.80%)
> Hmean send-8192 15940.64 ( 0.00%) 22023.98 ( 38.16%)
> Hmean send-16384 26699.54 ( 0.00%) 32703.27 ( 22.49%)
> Hmean recv-64 178.38 ( 0.00%) 259.89 ( 45.70%)
> Hmean recv-128 351.49 ( 0.00%) 532.43 ( 51.48%)
> Hmean recv-256 671.20 ( 0.00%) 1022.75 ( 52.38%)
> Hmean recv-1024 2663.45 ( 0.00%) 3994.49 ( 49.97%)
> Hmean recv-2048 5126.26 ( 0.00%) 7585.38 ( 47.97%)
> Hmean recv-3312 7949.50 ( 0.00%) 11695.43 ( 47.12%)
> Hmean recv-4096 9433.04 ( 0.00%) 13186.34 ( 39.79%)
> Hmean recv-8192 15939.64 ( 0.00%) 22021.41 ( 38.15%)
> Hmean recv-16384 26698.44 ( 0.00%) 32699.75 ( 22.48%)
>
> 1-socket 6 year old machine
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v4
> Hmean send-64 87.47 ( 0.00%) 129.07 ( 47.56%)
> Hmean send-128 174.36 ( 0.00%) 258.53 ( 48.27%)
> Hmean send-256 347.52 ( 0.00%) 511.02 ( 47.05%)
> Hmean send-1024 1363.03 ( 0.00%) 1990.02 ( 46.00%)
> Hmean send-2048 2632.68 ( 0.00%) 3775.89 ( 43.42%)
> Hmean send-3312 4123.19 ( 0.00%) 5915.77 ( 43.48%)
> Hmean send-4096 5056.48 ( 0.00%) 7152.31 ( 41.45%)
> Hmean send-8192 8784.22 ( 0.00%) 12089.53 ( 37.63%)
> Hmean send-16384 15081.60 ( 0.00%) 19650.42 ( 30.29%)
> Hmean recv-64 86.19 ( 0.00%) 128.43 ( 49.01%)
> Hmean recv-128 173.93 ( 0.00%) 257.28 ( 47.92%)
> Hmean recv-256 346.19 ( 0.00%) 508.40 ( 46.86%)
> Hmean recv-1024 1358.28 ( 0.00%) 1979.21 ( 45.71%)
> Hmean recv-2048 2623.45 ( 0.00%) 3747.30 ( 42.84%)
> Hmean recv-3312 4108.63 ( 0.00%) 5873.42 ( 42.95%)
> Hmean recv-4096 5037.25 ( 0.00%) 7101.54 ( 40.98%)
> Hmean recv-8192 8762.32 ( 0.00%) 12009.88 ( 37.06%)
> Hmean recv-16384 15042.36 ( 0.00%) 19513.27 ( 29.72%)
>
> This is somewhat dramatic but it's also not universal. For example, it was
> observed on an older HP machine using pcc-cpufreq that there was almost
> no difference but pcc-cpufreq is also a known performance hazard.
>
> These are quite different results but illustrate that the patch is
> dependent on the CPU. The results are similar for TCP_STREAM on
> the two-socket machine.
>
> The observations on sockperf are different.
>
> 2-socket modern machine
> sockperf-tcp-throughput
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v4
> Hmean 14 93.90 ( 0.00%) 92.74 ( -1.23%)
> Hmean 100 1211.02 ( 0.00%) 1284.36 ( 6.05%)
> Hmean 300 6016.95 ( 0.00%) 6149.26 ( 2.20%)
> Hmean 500 8846.20 ( 0.00%) 8988.84 ( 1.61%)
> Hmean 850 12280.71 ( 0.00%) 12434.78 ( 1.25%)
> Stddev 14 5.32 ( 0.00%) 4.79 ( 9.88%)
> Stddev 100 35.32 ( 0.00%) 74.20 (-110.06%)
> Stddev 300 132.63 ( 0.00%) 65.50 ( 50.61%)
> Stddev 500 152.90 ( 0.00%) 188.67 (-23.40%)
> Stddev 850 221.46 ( 0.00%) 257.61 (-16.32%)
>
> sockperf-udp-throughput
> 4.9.0-rc5 4.9.0-rc5
> vanilla hopcpu-v4
> Hmean 14 36.32 ( 0.00%) 51.25 ( 41.09%)
> Hmean 100 258.41 ( 0.00%) 355.76 ( 37.67%)
> Hmean 300 773.96 ( 0.00%) 1054.13 ( 36.20%)
> Hmean 500 1291.07 ( 0.00%) 1758.21 ( 36.18%)
> Hmean 850 2137.88 ( 0.00%) 2939.52 ( 37.50%)
> Stddev 14 0.75 ( 0.00%) 1.21 (-61.36%)
> Stddev 100 9.02 ( 0.00%) 11.53 (-27.89%)
> Stddev 300 13.66 ( 0.00%) 31.24 (-128.62%)
> Stddev 500 25.01 ( 0.00%) 53.44 (-113.67%)
> Stddev 850 37.72 ( 0.00%) 70.05 (-85.71%)
>
> Note that the improvements for TCP are nowhere near as dramatic as netperf,
> there is a slight loss for small packets and it's much more variable. While
> it's not presented here, it's known that running sockperf "under load"
> that packet latency is generally lower but not universally so. On the
> other hand, UDP improves performance but again, is much more variable.
>
> This highlights that the patch is not necessarily a universal win and is
> going to depend heavily on both the workload and the CPU used.
>
> hackbench was also tested with both socket and pipes and both processes
> and threads and the results are interesting in terms of how variability
> is imapcted
>
> 1-socket machine
> hackbench-process-pipes
> 4.9.0-rc5 4.9.0-rc5
> vanilla highmark-v3
> Amean 1 12.9637 ( 0.00%) 13.1807 ( -1.67%)
> Amean 3 13.4770 ( 0.00%) 13.6803 ( -1.51%)
> Amean 5 18.5333 ( 0.00%) 18.7383 ( -1.11%)
> Amean 7 24.5690 ( 0.00%) 23.0550 ( 6.16%)
> Amean 12 39.7990 ( 0.00%) 36.7207 ( 7.73%)
> Amean 16 56.0520 ( 0.00%) 48.2890 ( 13.85%)
> Stddev 1 0.3847 ( 0.00%) 0.5853 (-52.15%)
> Stddev 3 0.2652 ( 0.00%) 0.0295 ( 88.89%)
> Stddev 5 0.5589 ( 0.00%) 0.2466 ( 55.87%)
> Stddev 7 0.5310 ( 0.00%) 0.6680 (-25.79%)
> Stddev 12 1.0780 ( 0.00%) 0.3230 ( 70.04%)
> Stddev 16 2.1138 ( 0.00%) 0.6835 ( 67.66%)
>
> hackbench-process-sockets
> Amean 1 4.8873 ( 0.00%) 4.7180 ( 3.46%)
> Amean 3 14.1157 ( 0.00%) 14.3643 ( -1.76%)
> Amean 5 22.5537 ( 0.00%) 23.1380 ( -2.59%)
> Amean 7 30.3743 ( 0.00%) 31.1520 ( -2.56%)
> Amean 12 49.1773 ( 0.00%) 50.3060 ( -2.30%)
> Amean 16 64.0873 ( 0.00%) 66.2633 ( -3.40%)
> Stddev 1 0.2360 ( 0.00%) 0.2201 ( 6.74%)
> Stddev 3 0.0539 ( 0.00%) 0.0780 (-44.72%)
> Stddev 5 0.1463 ( 0.00%) 0.1579 ( -7.90%)
> Stddev 7 0.1260 ( 0.00%) 0.3091 (-145.31%)
> Stddev 12 0.2169 ( 0.00%) 0.4822 (-122.36%)
> Stddev 16 0.0529 ( 0.00%) 0.4513 (-753.20%)
>
> It's not a universal win for pipes but the differences are within the
> noise. What is interesting is that variability shows both gains and losses
> in stark contrast to the sockperf results. On the other hand, sockets
> generally show small losses albeit within the noise with more variability.
> Once again, the workload and CPU gets different results.
>
> fsmark was tested with zero-sized files to continually allocate slab objects
> but didn't show any differences. This can be explained by the fact that the
> workload is only allocating and does not have mix of allocs/frees that would
> benefit from the caching. It was tested to ensure no major harm was done.
>
> While it is recognised that this is a mixed bag of results, the patch
> helps a lot more workloads than it hurts and intuitively, avoiding the
> zone->lock in some cases is a good thing.
>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> Acked-by: Vlastimil Babka <vbabka@xxxxxxx>
> Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> Acked-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx>

Acked-by: Michal Hocko <mhocko@xxxxxxxx>

> ---
> include/linux/mmzone.h | 20 ++++++++-
> mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++------------------
> 2 files changed, 93 insertions(+), 44 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 0f088f3a2fed..54032ab2f4f9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -255,6 +255,24 @@ enum zone_watermarks {
> NR_WMARK
> };
>
> +/*
> + * One per migratetype for order-0 pages and one per high-order up to
> + * and including PAGE_ALLOC_COSTLY_ORDER. This may allow unmovable
> + * allocations to contaminate reclaimable pageblocks if high-order
> + * pages are heavily used.
> + */
> +#define NR_PCP_LISTS (MIGRATE_PCPTYPES + PAGE_ALLOC_COSTLY_ORDER)
> +
> +static inline unsigned int pindex_to_order(unsigned int pindex)
> +{
> + return pindex < MIGRATE_PCPTYPES ? 0 : pindex - MIGRATE_PCPTYPES + 1;
> +}
> +
> +static inline unsigned int order_to_pindex(int migratetype, unsigned int order)
> +{
> + return (order == 0) ? migratetype : MIGRATE_PCPTYPES + order - 1;
> +}
> +
> #define min_wmark_pages(z) (z->watermark[WMARK_MIN])
> #define low_wmark_pages(z) (z->watermark[WMARK_LOW])
> #define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
> @@ -265,7 +283,7 @@ struct per_cpu_pages {
> int batch; /* chunk size for buddy add/remove */
>
> /* Lists of pages, one per migrate type stored on the pcp-lists */
> - struct list_head lists[MIGRATE_PCPTYPES];
> + struct list_head lists[NR_PCP_LISTS];
> };
>
> struct per_cpu_pageset {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6de9440e3ae2..94808f565f74 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1050,9 +1050,9 @@ static __always_inline bool free_pages_prepare(struct page *page,
> }
>
> #ifdef CONFIG_DEBUG_VM
> -static inline bool free_pcp_prepare(struct page *page)
> +static inline bool free_pcp_prepare(struct page *page, unsigned int order)
> {
> - return free_pages_prepare(page, 0, true);
> + return free_pages_prepare(page, order, true);
> }
>
> static inline bool bulkfree_pcp_prepare(struct page *page)
> @@ -1060,9 +1060,9 @@ static inline bool bulkfree_pcp_prepare(struct page *page)
> return false;
> }
> #else
> -static bool free_pcp_prepare(struct page *page)
> +static bool free_pcp_prepare(struct page *page, unsigned int order)
> {
> - return free_pages_prepare(page, 0, false);
> + return free_pages_prepare(page, order, false);
> }
>
> static bool bulkfree_pcp_prepare(struct page *page)
> @@ -1085,8 +1085,9 @@ static bool bulkfree_pcp_prepare(struct page *page)
> static void free_pcppages_bulk(struct zone *zone, int count,
> struct per_cpu_pages *pcp)
> {
> - int migratetype = 0;
> - int batch_free = 0;
> + unsigned int pindex = UINT_MAX; /* Reclaim will start at 0 */
> + unsigned int batch_free = 0;
> + unsigned int nr_freed = 0;
> unsigned long nr_scanned;
> bool isolated_pageblocks;
>
> @@ -1096,28 +1097,29 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> if (nr_scanned)
> __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>
> - while (count) {
> + while (count > 0) {
> struct page *page;
> struct list_head *list;
> + unsigned int order;
>
> /*
> * Remove pages from lists in a round-robin fashion. A
> * batch_free count is maintained that is incremented when an
> - * empty list is encountered. This is so more pages are freed
> - * off fuller lists instead of spinning excessively around empty
> - * lists
> + * empty list is encountered. This is not exact due to
> + * high-order but percision is not required.
> */
> do {
> batch_free++;
> - if (++migratetype == MIGRATE_PCPTYPES)
> - migratetype = 0;
> - list = &pcp->lists[migratetype];
> + if (++pindex == NR_PCP_LISTS)
> + pindex = 0;
> + list = &pcp->lists[pindex];
> } while (list_empty(list));
>
> /* This is the only non-empty list. Free them all. */
> - if (batch_free == MIGRATE_PCPTYPES)
> + if (batch_free == NR_PCP_LISTS)
> batch_free = count;
>
> + order = pindex_to_order(pindex);
> do {
> int mt; /* migratetype of the to-be-freed page */
>
> @@ -1135,11 +1137,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> if (bulkfree_pcp_prepare(page))
> continue;
>
> - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> - trace_mm_page_pcpu_drain(page, 0, mt);
> - } while (--count && --batch_free && !list_empty(list));
> + __free_one_page(page, page_to_pfn(page), zone, order, mt);
> + trace_mm_page_pcpu_drain(page, order, mt);
> + nr_freed += (1 << order);
> + count -= (1 << order);
> + } while (count > 0 && --batch_free && !list_empty(list));
> }
> spin_unlock(&zone->lock);
> + pcp->count -= nr_freed;
> }
>
> static void free_one_page(struct zone *zone,
> @@ -2243,10 +2248,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
> local_irq_save(flags);
> batch = READ_ONCE(pcp->batch);
> to_drain = min(pcp->count, batch);
> - if (to_drain > 0) {
> + if (to_drain > 0)
> free_pcppages_bulk(zone, to_drain, pcp);
> - pcp->count -= to_drain;
> - }
> local_irq_restore(flags);
> }
> #endif
> @@ -2268,10 +2271,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
> pset = per_cpu_ptr(zone->pageset, cpu);
>
> pcp = &pset->pcp;
> - if (pcp->count) {
> + if (pcp->count)
> free_pcppages_bulk(zone, pcp->count, pcp);
> - pcp->count = 0;
> - }
> local_irq_restore(flags);
> }
>
> @@ -2403,18 +2404,18 @@ void mark_free_pages(struct zone *zone)
> #endif /* CONFIG_PM */
>
> /*
> - * Free a 0-order page
> + * Free a pcp page
> * cold == true ? free a cold page : free a hot page
> */
> -void free_hot_cold_page(struct page *page, bool cold)
> +static void __free_hot_cold_page(struct page *page, bool cold, unsigned int order)
> {
> struct zone *zone = page_zone(page);
> struct per_cpu_pages *pcp;
> unsigned long flags;
> unsigned long pfn = page_to_pfn(page);
> - int migratetype;
> + int migratetype, pindex;
>
> - if (!free_pcp_prepare(page))
> + if (!free_pcp_prepare(page, order))
> return;
>
> migratetype = get_pfnblock_migratetype(page, pfn);
> @@ -2431,28 +2432,33 @@ void free_hot_cold_page(struct page *page, bool cold)
> */
> if (migratetype >= MIGRATE_PCPTYPES) {
> if (unlikely(is_migrate_isolate(migratetype))) {
> - free_one_page(zone, page, pfn, 0, migratetype);
> + free_one_page(zone, page, pfn, order, migratetype);
> goto out;
> }
> migratetype = MIGRATE_MOVABLE;
> }
>
> + pindex = order_to_pindex(migratetype, order);
> pcp = &this_cpu_ptr(zone->pageset)->pcp;
> if (!cold)
> - list_add(&page->lru, &pcp->lists[migratetype]);
> + list_add(&page->lru, &pcp->lists[pindex]);
> else
> - list_add_tail(&page->lru, &pcp->lists[migratetype]);
> - pcp->count++;
> + list_add_tail(&page->lru, &pcp->lists[pindex]);
> + pcp->count += 1 << order;
> if (pcp->count >= pcp->high) {
> unsigned long batch = READ_ONCE(pcp->batch);
> free_pcppages_bulk(zone, batch, pcp);
> - pcp->count -= batch;
> }
>
> out:
> local_irq_restore(flags);
> }
>
> +void free_hot_cold_page(struct page *page, bool cold)
> +{
> + __free_hot_cold_page(page, cold, 0);
> +}
> +
> /*
> * Free a list of 0-order pages
> */
> @@ -2588,20 +2594,33 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> struct page *page;
> bool cold = ((gfp_flags & __GFP_COLD) != 0);
>
> - if (likely(order == 0)) {
> + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> struct per_cpu_pages *pcp;
> struct list_head *list;
>
> local_irq_save(flags);
> do {
> + unsigned int pindex;
> +
> + pindex = order_to_pindex(migratetype, order);
> pcp = &this_cpu_ptr(zone->pageset)->pcp;
> - list = &pcp->lists[migratetype];
> + list = &pcp->lists[pindex];
> if (list_empty(list)) {
> - pcp->count += rmqueue_bulk(zone, 0,
> + int nr_pages = rmqueue_bulk(zone, order,
> pcp->batch, list,
> migratetype, cold);
> - if (unlikely(list_empty(list)))
> + if (unlikely(list_empty(list))) {
> + /*
> + * Retry high-order atomic allocs
> + * from the buddy list which may
> + * use MIGRATE_HIGHATOMIC.
> + */
> + if (order && (alloc_flags & ALLOC_HARDER))
> + goto try_buddylist;
> +
> goto failed;
> + }
> + pcp->count += (nr_pages << order);
> }
>
> if (cold)
> @@ -2610,10 +2629,11 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> page = list_first_entry(list, struct page, lru);
>
> list_del(&page->lru);
> - pcp->count--;
> + pcp->count -= (1 << order);
>
> } while (check_new_pcp(page));
> } else {
> +try_buddylist:
> /*
> * We most definitely don't want callers attempting to
> * allocate greater than order-1 page units with __GFP_NOFAIL.
> @@ -3837,8 +3857,8 @@ EXPORT_SYMBOL(get_zeroed_page);
> void __free_pages(struct page *page, unsigned int order)
> {
> if (put_page_testzero(page)) {
> - if (order == 0)
> - free_hot_cold_page(page, false);
> + if (order <= PAGE_ALLOC_COSTLY_ORDER)
> + __free_hot_cold_page(page, false, order);
> else
> __free_pages_ok(page, order);
> }
> @@ -5160,20 +5180,31 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
> /* a companion to pageset_set_high() */
> static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
> {
> - pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
> + unsigned long high;
> +
> + /*
> + * per-cpu refills occur when a per-cpu list for a migratetype
> + * or a high-order is depleted even if pages are free overall.
> + * Tune the high watermark such that it's unlikely, but not
> + * impossible, that a single refill event will trigger a
> + * shrink on the next free to the per-cpu list.
> + */
> + high = batch * MIGRATE_PCPTYPES + (batch << PAGE_ALLOC_COSTLY_ORDER);
> +
> + pageset_update(&p->pcp, high, max(1UL, 1 * batch));
> }
>
> static void pageset_init(struct per_cpu_pageset *p)
> {
> struct per_cpu_pages *pcp;
> - int migratetype;
> + unsigned int pindex;
>
> memset(p, 0, sizeof(*p));
>
> pcp = &p->pcp;
> pcp->count = 0;
> - for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
> - INIT_LIST_HEAD(&pcp->lists[migratetype]);
> + for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
> + INIT_LIST_HEAD(&pcp->lists[pindex]);
> }
>
> static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
> --
> 2.10.2
>

--
Michal Hocko
SUSE Labs