[PATCH v2] mm, vmscan: don't turn on cache_trim_mode at high scan priorities

From: Byungchul Park
Date: Thu Feb 22 2024 - 00:50:17 EST


With cache_trim_mode on, reclaim logic doesn't bother reclaiming anon
pages. However, it should be more careful to turn on the mode because
it's going to prevent anon pages from being reclaimed even if there are
a huge number of anon pages that are cold and should be reclaimed. Even
worse, that can lead kswapd_failures to reach MAX_RECLAIM_RETRIES and
stopping kswapd until direct reclaim eventually works to resume kswapd.
So this is more like a bug fix than a performance improvement.

The problematic behavior can be reproduced by:

CONFIG_NUMA_BALANCING enabled
sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING

numa node0 (8GB local memory, 16 CPUs)
numa node1 (8GB slow tier memory, no CPUs)

Sequence:

1) echo 3 > /proc/sys/vm/drop_caches
2) To emulate the system with full of cold memory in local DRAM, run
the following dummy program and never touch the region:

mmap(0, 8 * 1024 * 1024 * 1024, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, 0);

3) Run any memory intensive work e.g. XSBench.
4) Check if numa balancing is working e.i. promotion/demotion.
5) Iterate 1) ~ 4) until kswapd stops.

With this, you could eventually see that promotion/demotion are not
working because kswapd has stopped due to ->kswapd_failures >=
MAX_RECLAIM_RETRIES.

Interesting vmstat delta's differences between before and after are like:

-nr_inactive_anon 321935
-nr_active_anon 1780700
-nr_inactive_file 30425
-nr_active_file 14961
-pgpromote_success 356
-pgpromote_candidate 21953245
-pgactivate 1844523
-pgdeactivate 50634
-pgfault 31100294
-pgdemote_kswapd 30856
-pgscan_kswapd 1861981
-pgscan_anon 1822930
-pgscan_file 39051
-pgsteal_anon 386
-pgsteal_file 30470
-pageoutrun 30
-numa_hint_faults 27418279
-numa_pages_migrated 356

+nr_inactive_anon 1662306
+nr_active_anon 440303
+nr_inactive_file 27669
+nr_active_file 1654
+pgpromote_success 1314102
+pgpromote_candidate 1892525
+pgactivate 3284457
+pgdeactivate 1527504
+pgfault 6847775
+pgdemote_kswapd 2142047
+pgscan_kswapd 7496588
+pgscan_anon 7462488
+pgscan_file 34100
+pgsteal_anon 2115661
+pgsteal_file 26386
+pageoutrun 378
+numa_hint_faults 3220891
+numa_pages_migrated 1314102

where -: before this patch, +: after this patch

Signed-off-by: Byungchul Park <byungchul@xxxxxx>
---
mm/vmscan.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bba207f41b14..6eda59fce5ee 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2266,9 +2266,17 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
* If we have plenty of inactive file pages that aren't
* thrashing, try to reclaim those first before touching
* anonymous pages.
+ *
+ * However, the condition 'sc->cache_trim_mode == 1' all through
+ * the scan priorties might lead reclaim failure. If it keeps
+ * MAX_RECLAIM_RETRIES times, then kswapd would get stopped even
+ * if there are still plenty anon pages to reclaim, which is not
+ * desirable. So do not use cache_trim_mode when reclaim is not
+ * smooth e.i. high scan priority.
*/
file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
- if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
+ if (sc->priority > 1 && file >> sc->priority &&
+ !(sc->may_deactivate & DEACTIVATE_FILE))
sc->cache_trim_mode = 1;
else
sc->cache_trim_mode = 0;
--
2.17.1