[PATCH -V2 10/10] mm, pcp: reduce detecting time of consecutive high order page freeing

From: Huang Ying
Date: Tue Sep 26 2023 - 02:10:58 EST

Next message: Arnd Bergmann: "Re: [PATCH v1 6/7] DCE/DSE: riscv: add HAVE_TRIM_UNUSED_SYSCALLS support"
Previous message: Huang Ying: "[PATCH -V2 09/10] mm, pcp: avoid to reduce PCP high unnecessarily"
In reply to: Huang Ying: "[PATCH -V2 09/10] mm, pcp: avoid to reduce PCP high unnecessarily"
Next in thread: Huang Ying: "[PATCH -V2 02/10] cacheinfo: calculate per-CPU data cache size"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

In current PCP auto-tuning design, if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become
the maximal value even if the allocating/freeing depth is small, for
example, in the sender of network workloads. If a CPU was used as
sender originally, then it is used as receiver after context
switching, we need to fill the whole PCP with maximal high before
triggering PCP draining for consecutive high order freeing. This will
hurt the performance of some network workloads.

To solve the issue, in this patch, we will track the consecutive page
freeing with a counter in stead of relying on PCP draining. So, we
can detect consecutive page freeing much earlier.

On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair
processes. With the patch, the network bandwidth improves 3.1%. This
restores the performance drop caused by PCP auto-tuning.

Signed-off-by: "Huang, Ying" <ying.huang@xxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Cc: David Hildenbrand <david@xxxxxxxxxx>
Cc: Johannes Weiner <jweiner@xxxxxxxxxx>
Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Pavel Tatashin <pasha.tatashin@xxxxxxxxxx>
Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
Cc: Christoph Lameter <cl@xxxxxxxxx>
---
include/linux/mmzone.h | 2 +-
mm/page_alloc.c | 23 +++++++++++------------
2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 35b78c7522a7..44f6dc3cdeeb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -689,10 +689,10 @@ struct per_cpu_pages {
int batch; /* chunk size for buddy add/remove */
u8 flags; /* protected by pcp->lock */
u8 alloc_factor; /* batch scaling factor during allocate */
- u8 free_factor; /* batch scaling factor during free */
#ifdef CONFIG_NUMA
u8 expire; /* When 0, remote pagesets are drained */
#endif
+ short free_count; /* consecutive free count */

/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[NR_PCP_LISTS];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d7b602822ab3..206ab768ec23 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2375,13 +2375,10 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free
max_nr_free = high - batch;

/*
- * Double the number of pages freed each time there is subsequent
- * freeing of pages without any allocation.
+ * Increase the batch number to the number of the consecutive
+ * freed pages to reduce zone lock contention.
*/
- batch <<= pcp->free_factor;
- if (batch <= max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX)
- pcp->free_factor++;
- batch = clamp(batch, min_nr_free, max_nr_free);
+ batch = clamp_t(int, pcp->free_count, min_nr_free, max_nr_free);

return batch;
}
@@ -2408,7 +2405,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
* stored on pcp lists
*/
if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) {
- pcp->high = max(high - (batch << pcp->free_factor), high_min);
+ pcp->high = max(high - pcp->free_count, high_min);
return min(batch << 2, pcp->high);
}

@@ -2416,10 +2413,10 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
return high;

if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) {
- pcp->high = max(high - (batch << pcp->free_factor), high_min);
+ pcp->high = max(high - pcp->free_count, high_min);
high = max(pcp->count, high_min);
} else if (pcp->count >= high) {
- int need_high = (batch << pcp->free_factor) + batch;
+ int need_high = pcp->free_count + batch;

/* pcp->high should be large enough to hold batch freed pages */
if (pcp->high < need_high)
@@ -2456,7 +2453,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
* stops will be drained from vmstat refresh context.
*/
if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
- free_high = (pcp->free_factor &&
+ free_high = (pcp->free_count >= batch &&
(pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
(!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
pcp->count >= READ_ONCE(batch)));
@@ -2464,6 +2461,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
}
+ if (pcp->free_count < (batch << PCP_BATCH_SCALE_MAX))
+ pcp->free_count += (1 << order);
high = nr_pcp_high(pcp, zone, batch, free_high);
if (pcp->count >= high) {
free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2861,7 +2860,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
* See nr_pcp_free() where free_factor is increased for subsequent
* frees.
*/
- pcp->free_factor >>= 1;
+ pcp->free_count >>= 1;
list = &pcp->lists[order_to_pindex(migratetype, order)];
page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
pcp_spin_unlock(pcp);
@@ -5483,7 +5482,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
pcp->high_min = BOOT_PAGESET_HIGH;
pcp->high_max = BOOT_PAGESET_HIGH;
pcp->batch = BOOT_PAGESET_BATCH;
- pcp->free_factor = 0;
+ pcp->free_count = 0;
}

static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_min,
--
2.39.2

Next message: Arnd Bergmann: "Re: [PATCH v1 6/7] DCE/DSE: riscv: add HAVE_TRIM_UNUSED_SYSCALLS support"
Previous message: Huang Ying: "[PATCH -V2 09/10] mm, pcp: avoid to reduce PCP high unnecessarily"
In reply to: Huang Ying: "[PATCH -V2 09/10] mm, pcp: avoid to reduce PCP high unnecessarily"
Next in thread: Huang Ying: "[PATCH -V2 02/10] cacheinfo: calculate per-CPU data cache size"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]