[RFC 0/2] How effective is numa_preferred_nid w.r.t. NUMA performance?

From: chris hyser
Date: Fri Dec 15 2023 - 19:19:30 EST

Next message: Elliot Berman: "[PATCH RFC v15 02/30] dt-bindings: Add binding for gunyah hypervisor"
Previous message: chris hyser: "[RFC/POC 2/2] sched/numa: Adds simple prctl for setting task's preferred node affinity."
Next in thread: chris hyser: "[RFC/POC 1/2] sched/numa: Adds ability to over-ride a tasks numa_preferred_nid."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

The commentary around the initial Oracle Soft Affinity proposal [1] had
recommended investigating the use of numa_preferred_nid as a better solution.
The primary driver for the original proposal (as well as now) is better NUMA
performance involving important task's accessing RDMA pinned memory. I wanted a
fairly simple test to explore the various aspects of NUMA performance and that
didn't require lots of time running TPC-C on a tuned DB as Subhra had done. I
needed something that would allow both task and memory placement, with usable
NUMA sensitivity and I think I stumbled onto something quite useful. As the test
is only concerned with the NUMA effects of scheduler/balancer placement
decisions, no locks, no communications, no syscalls etc during the timed loop,
it does not represent any actual useful load. Thus making it, I suppose, a NUMA
micro-benchmark.

A simplified description of the resulting benchmark is first a probe process
which times an outer loop doing a specified "counts" worth of a tight inner
loop. The inner loop in sequential mode would access every u64 in a large
buffer, but in this case it is an equivalent number of random (u64 aligned)
indexes into the memory buffer accessed by a 64-bit read then 64-bit write (the
code provides seq vs rand access as well as various access patterns, but this is
the combo most interesting for this). The probe's buffer memory is either
allowed to float or be bound to particular NUMA nodes while also allowing the
NUMA affinity of the process itself to be set (uses hard affinity) as well as
supporting use of the prctl() in patch 2 to set a "Preferred Node Affinity". The
main difference between this and probably dozens of similar programs is that the
probe isn't the benchmark; its just an extremely NUMA sensitive process. If you
run it by itself on an idle system it will park on a CPU, fill up the associated
caches and tell you absolutely nothing.

What ultimately makes this interesting is running it in the presence of load,
specifically a constant percentage of cpu-only load replicated and pinned on
each CPU. So, for example, HTOP would show all but one CPU at say 60% (what I
used in generating the results here, but the "effect" occurs even with just a 1%
load) with that lone CPU running the probe and pegged at 100%. The result of
this is the load balancer really feeling the need to balance and the NUMA
awareness of those placement decisions are clearly discernible in the probe's
measured times. As well, the runtimes are sufficiently short to enable tracing
the entire life of the probe and categorizing all migrations as 'same core',
'same node', and 'cross node'.

The above is a minimal description of the benchmark. I will be making this
available if people are interested (that and when I get internal stuff sorted),
so after the holidays.

In terms of showing results, I also have test data for an AMD 8-node and an
ARM64 2-node box. I've also run tests exploring the benchmark over a range of
different migration_cost_ns values. Again, if people are interested, I have
data to share.

Test Results:
--------------
The below tests were run on an Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz
box. This has two LLC-spanned memory nodes and 104 CPUs. The kernel was recent
tip:sched/core with the included patches (POC only) just to show the changes.

Key:
-----------------
NB - auto-numa-balancing (0 - off, 1 - on)
PNID - the prctl() "forced" numa_preferred_nid, ie 'Preferred Node Affinity'.
(given 2 nodes: 0, 1, and -1 for not_set)
Mem - represents the Memory node when memory is bound, else 'F' floating,
ie not set
CPU - represents the CPUs of the node that the probe is hard-affined to, else
'F' floating, ie not set
Avg - the average time of the probe's measurements in secs

Each line below represents the average of 64 test runs with the indicated
parameters.

NumSamples: 64
Kernel: 6.7.0-rc1_ch_pna7_7+_#213 SMP PREEMPT_DYNAMIC Thu Dec 7 15:16:59 EST 2023
Load: 60
CPU_Model: IntelR XeonR Platinum 8167M CPU @ 2.00GHz
NUM_CPUS: 104
migration_cost_ns: 500000

Avg max min stdv | Test Parameters
----------------------------------------------------------------------
[00] 136.50 141.76 122.08 2.95 | PNID: -1 NB: 0 Mem: 0 CPU: 0
[01] 168.78 172.07 156.04 2.58 | PNID: -1 NB: 0 Mem: 0 CPU: 1
[02] 173.15 180.73 153.41 4.89 | PNID: -1 NB: 0 Mem: 0 CPU: F
[03] 165.95 169.17 162.13 1.57 | PNID: -1 NB: 0 Mem: 1 CPU: 0
[04] 137.23 144.28 123.75 4.97 | PNID: -1 NB: 0 Mem: 1 CPU: 1
[05] 179.90 187.21 165.90 3.73 | PNID: -1 NB: 0 Mem: 1 CPU: F
[06] 163.87 170.68 147.56 6.31 | PNID: -1 NB: 0 Mem: F CPU: 0
[07] 168.96 174.40 156.51 3.74 | PNID: -1 NB: 0 Mem: F CPU: 1
[08] 180.71 185.51 169.74 3.33 | PNID: -1 NB: 0 Mem: F CPU: F

[09] 135.68 139.28 119.92 2.93 | PNID: -1 NB: 1 Mem: 0 CPU: 0
[10] 166.60 169.82 160.05 1.76 | PNID: -1 NB: 1 Mem: 0 CPU: 1
[11] 171.97 181.91 163.94 3.70 | PNID: -1 NB: 1 Mem: 0 CPU: F
[12] 164.01 170.34 152.37 2.82 | PNID: -1 NB: 1 Mem: 1 CPU: 0
[13] 138.01 142.27 135.20 1.22 | PNID: -1 NB: 1 Mem: 1 CPU: 1
[14] 177.07 184.39 163.89 3.56 | PNID: -1 NB: 1 Mem: 1 CPU: F
[15] 165.70 171.33 154.46 2.41 | PNID: -1 NB: 1 Mem: F CPU: 0
[16] 165.18 170.83 149.12 5.99 | PNID: -1 NB: 1 Mem: F CPU: 1
[17] 148.91 163.04 134.31 5.48 | PNID: -1 NB: 1 Mem: F CPU: F

[18] 135.63 138.63 122.85 2.07 | PNID: 0 NB: 1 Mem: 0 CPU: 0
[19] 162.38 170.60 146.03 6.73 | PNID: 0 NB: 1 Mem: 0 CPU: 1
[20] 129.20 135.26 114.55 3.28 | PNID: 0 NB: 1 Mem: 0 CPU: F
[21] 161.71 168.72 144.87 5.55 | PNID: 0 NB: 1 Mem: 1 CPU: 0
[22] 135.72 140.44 123.34 3.10 | PNID: 0 NB: 1 Mem: 1 CPU: 1
[23] 155.07 162.20 138.71 4.50 | PNID: 0 NB: 1 Mem: 1 CPU: F
[24] 163.42 169.29 146.95 5.04 | PNID: 0 NB: 1 Mem: F CPU: 0
[25] 165.90 170.44 157.56 1.67 | PNID: 0 NB: 1 Mem: F CPU: 1
[26] 140.45 148.37 117.02 5.81 | PNID: 0 NB: 1 Mem: F CPU: F

[27] 135.26 140.78 123.29 2.30 | PNID: 1 NB: 1 Mem: 0 CPU: 0
[28] 166.22 169.51 148.18 2.65 | PNID: 1 NB: 1 Mem: 0 CPU: 1
[29] 157.91 165.94 153.48 2.75 | PNID: 1 NB: 1 Mem: 0 CPU: F
[30] 162.08 166.76 148.14 3.37 | PNID: 1 NB: 1 Mem: 1 CPU: 0
[31] 136.86 140.03 127.42 2.01 | PNID: 1 NB: 1 Mem: 1 CPU: 1
[32] 131.85 141.38 114.66 5.55 | PNID: 1 NB: 1 Mem: 1 CPU: F
[33] 163.64 169.48 149.35 2.74 | PNID: 1 NB: 1 Mem: F CPU: 0
[34] 165.94 170.47 156.10 2.41 | PNID: 1 NB: 1 Mem: F CPU: 1
[35] 145.28 154.64 137.84 3.60 | PNID: 1 NB: 1 Mem: F CPU: F

Observations:
---------------
First we see the expected results that memory and cpu bound/pinned on the same
node {0,4,9,13,18,22,27,31} is quite a bit faster than when bound/pinned on
different nodes {1,3,10,12,19,21,28,30}. Completely unexpected was that when
binding memory to a node but allowing the CPU to float (ie, let the scheduler
"schedule", the load balancer "balance") or both float, the performance is as
bad or worse than pinning CPU's and memory on different nodes {2,5,8,11,14}. NB
does help when both memory and the CPU float.

How is that possible? I did some traces of the probe with identical
params/kernel etc. These were then categorized as "same-core", "same-node (minus
same core)", and "cross-node".

Given this platform, a reasonable hypothesis is that cross-node migrations are
trashing the LLC and that is a big deal from a pure NUMA perspective. Is there a
general correlation between the number of cross-node migrations and the longer
completion times? The answer I believe is yes. (The below are representative
samples versus averages as there is still a manual step.)

When both memory and the CPUs are pinned (same node or diff) we see no
cross-node migrations (the 1 is from when the probe started on a different node
than it later hard affined to)

CPU: 0, Mem: 0, NB=0, PNID=-1
-----------------------------------------------------------------
num_migrations_samecore : 846 num_migrations_samecore : 887
num_migrations_samenode : 2442 num_migrations_samenode : 2375
num_migrations_crossnode: 1 num_migrations_crossnode: 1
num_migrations: 3289 num_migrations: 3263

CPU: 1, Mem: 1, NB=0, PNID=-1
-----------------------------------------------------------------
num_migrations_samecore : 822 num_migrations_samecore : 886
num_migrations_samenode : 2156 num_migrations_samenode : 1982
num_migrations_crossnode: 0 num_migrations_crossnode: 0
num_migrations: 2978 num_migrations: 2868

CPU: 0, Mem: 1, NB=0, PNID=-1
-----------------------------------------------------------------
num_migrations_samecore : 1038 num_migrations_samecore : 1055
num_migrations_samenode : 2892 num_migrations_samenode : 2824
num_migrations_crossnode: 0 num_migrations_crossnode: 1
num_migrations: 3931 num_migrations: 3879

Compared to both CPU and memory allowed to float (as well as the impact of NB
and PNID):
CPU: F, Mem: F, NB=0, PNID=-1
-----------------------------------------------------------------
num_migrations_samecore : 681 num_migrations_samecore : 800
num_migrations_samenode : 2306 num_migrations_samenode : 2255
num_migrations_crossnode: 1548 num_migrations_crossnode: 1503
num_migrations: 4535 num_migrations: 4558

CPU: F, Mem: F, NB=1, PNID=-1
-----------------------------------------------------------------
num_migrations_samecore : 799 num_migrations_samecore : 646
num_migrations_samenode : 3098 num_migrations_samenode : 2775
num_migrations_crossnode: 104 num_migrations_crossnode: 236
num_migrations: 4001 num_migrations: 3657

CPU: F, Mem: F, NB=1, PNID=0
-----------------------------------------------------------------
num_migrations_samecore : 718 num_migrations_samecore : 737
num_migrations_samenode : 3148 num_migrations_samenode : 3274
num_migrations_crossnode: 2 num_migrations_crossnode: 7
num_migrations: 3868 num_migrations: 4018

We see that NB does have a big impact (decrease in cross-node migrations) and
confirmed by much better measured times. line {17} vs line {8}.

In terms of the primary use case, pinned RDMA mem buffers, the interesting
results are where the CPU is allowed to float with memory pinned
{2,5,8,11,14,17,20,23,26,29,32,35}. What do the migration counts look like for
those accesses:

CPU: F, Mem: 0, NB=0, PNID=-1
-----------------------------------------------------------------
num_migrations_samecore : 788 num_migrations_samecore : 739
num_migrations_samenode : 2251 num_migrations_samenode : 2292
num_migrations_crossnode: 1738 num_migrations_crossnode: 1500
num_migrations: 4777 num_migrations: 4531

CPU: F, Mem: 0, NB=1, PNID=-1
-----------------------------------------------------------------
num_migrations_samecore : 663 num_migrations_samecore : 657
num_migrations_samenode : 2434 num_migrations_samenode : 2427
num_migrations_crossnode: 1344 num_migrations_crossnode: 1499
num_migrations: 4441 num_migrations: 4583

CPU: F, Mem: 0, NB=1, PNID=0
-----------------------------------------------------------------
num_migrations_samecore : 653 num_migrations_samecore : 665
num_migrations_samenode : 2954 num_migrations_samenode : 2880
num_migrations_crossnode: 7 num_migrations_crossnode: 12
num_migrations: 3614 num_migrations: 3557

>From a purely NUMA perspective, accurately setting the preferred node from user
space, "Preferred Node Affinity", appears to be a substantial win as can be seen
by comparing lines {2, 11} vs line {20} and lines {5, 14} vs line {32}.

We also see that NB does not have nearly the same effect with the CPU node
floating and the memory bound as when both were floating. The function
task_numa_work() does clearly skip non-migratable VMAs. The issue with this is
that when enabling NB, the most important accesses of some tasks aren't tracked,
while the accesses that are can lead to the wrong value for numa_preferred_nid,
and thus NB gets turned off.

On digging into this further, there was a 2014 presentation "Automatic NUMA
Balancing" [2] which declares support for "unmovable" memory as a future,
recognizes it's value in correctly setting numa_preferred_nid, but says it is
unclear if it is worthwhile. I am currently working on enabling this and running
such tests.

As a final note, I will have a chance to validate the effects of these changes
against the DB next month.

[1] [RFC PATCH 0/3] Scheduler Soft Affinity
https://lore.kernel.org/lkml/20190626224718.21973-1-subhra.mazumdar@xxxxxxxxxx/

[2] "Automatic NUMA Balancing",
https://events.static.linuxfound.org/sites/events/files/slides/summit2014_riel_chegu_w_0340_automatic_numa_balancing_0.pdf

Next message: Elliot Berman: "[PATCH RFC v15 02/30] dt-bindings: Add binding for gunyah hypervisor"
Previous message: chris hyser: "[RFC/POC 2/2] sched/numa: Adds simple prctl for setting task's preferred node affinity."
Next in thread: chris hyser: "[RFC/POC 1/2] sched/numa: Adds ability to over-ride a tasks numa_preferred_nid."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]