[RFC PATCH 0/2] Hot page promotion optimization for large address space

From: Bharata B Rao
Date: Wed Mar 27 2024 - 12:03:14 EST


In order to check how efficiently the existing NUMA balancing
based hot page promotion mechanism can detect hot regions and
promote pages for workloads with large memory footprints, I
wrote and tested a program that allocates huge amount of
memory but routinely touches only small parts of it.

This microbenchmark provisions memory both on DRAM node and CXL node.
It then divides the entire allocated memory into chunks of smaller
size and randomly choses a chunk for generating memory accesses.
Each chunk is then accessed for a fixed number of iterations to
create the notion of hotness. Within each chunk, the individual
pages at 4K granularity are again accessed in random fashion.

When a chunk is taken up for access in this manner, its pages
can either be residing on DRAM or CXL. In the latter case, the NUMA
balancing driven hot page promotion logic is expected to detect and
promote the hot pages that reside on CXL.

The experiment was conducted on a 2P AMD Bergamo system that has
CXL as the 3rd node.

$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0-127,256-383
node 0 size: 128054 MB
node 1 cpus: 128-255,384-511
node 1 size: 128880 MB
node 2 cpus:
node 2 size: 129024 MB
node distances:
node 0 1 2
0: 10 32 60
1: 32 10 50
2: 255 255 10

It is seen that number of pages that get promoted is really low and
the reason for it happens to be that the NUMA hint fault latency turns
out to be much higher than the hot threshold most of the times. Here
are a few latency and threshold sample values captured from
should_numa_migrate_memory() routine when the benchmark was run:

latency threshold (in ms)
20620 1125
56185 1125
98710 1250
148871 1375
182891 1625
369415 1875
630745 2000

The NUMA hint fault latency metric, which is based on absolute time
difference between scanning time and fault time may not be suitable
for applications that have large amounts of memory. If the time
difference between the scan time PTE update and the subsequent access
(hint fault) is more, the existing logic in should_numa_migrate_memory()
to determine if the page needs to be migrated, will exclude more
pages than it selects pages for promotion.

To address this problem, this RFC converts the absolute time based
hint fault latency in to a relative metric. The number of hint faults
that have occurred between the scan time and the page's fault time
is used as the threshold.

This is quite an experimental work and there are things to take
care of still. While more testing needs to be conducted with different
benchmarks, I am posting the patchset here to just get early feedback.

Microbenchmark
==============
Total allocation is 192G which initially occupies full of Node 1 (DRAM)
and half of Node 2 (CXL)
Chunk size is 1G

Default Patched

Benchmark score (us) 637,787,351 571,350,410 (-10.41%)
(Lesser is better)

numa_pte_updates 29,834,747 29,275,489
numa_hint_faults 12,512,736 12,080,772
numa_hint_faults_local 0 0
numa_pages_migrated 1,804,583 6,709,580
pgpromote_success 1,804,500 6,709,526
pgpromote_candidate 1,916,720 7,523,345
pgdemote_kswapd 5,358,119 9,438,006
pgdemote_direct 0 0

Default Patched
Number of times
should_numa_migrate_memory()
was invoked: 12,512,736 12,080,772

Number of times the migration
request was rejected due to
hint fault latency being
higher than threshold: 10,595,933 4,557,401

Redis-memtier
=============
memtier_benchmark -t 512 -n 25000 --ratio 1:1 -c 20 -x 1 --key-pattern R:R
--hide-histogram --distinct-client-seed -d 20000 --pipeline=1000

Default Patched

Ops/sec 51,921.16 52,694.55
Hits/sec 21,908.72 22,235.03
Misses/sec 4051.86 4112.24
Avg. Latency 867.51710 591.27561 (-31.84%)
p50 Latency 876.54300 708.60700 (-19.15%)
p99 Latency 1044.47900 1044.47900
p99.9 Latency 1048.57500 1048.57500
KB/sec 937,330.19 951,291.76

numa_pte_updates 66,628,064 72,125,512
numa_hint_faults 57,093,369 63,369,538
numa_hint_faults_local 0 0
numa_pages_migrated 799,128 3,634,114
pgpromote_success 798,974 3,633,672
pgpromote_candidate 33,884,196 23,143,552
pgdemote_kswapd 13,321,784 11,948,894
pgdemote_direct 257 57,147

Bharata B Rao (2):
sched/numa: Fault count based NUMA hint fault latency
mm: Update hint fault count for pages that are skipped during scanning

include/linux/mm.h | 23 ++++---------
include/linux/mm_types.h | 3 ++
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 73 +++++++++++-----------------------------
kernel/sched/sched.h | 1 +
mm/huge_memory.c | 10 +++---
mm/memory.c | 2 ++
mm/mprotect.c | 14 ++++----
8 files changed, 46 insertions(+), 82 deletions(-)

--
2.25.1