On Sat, Apr 23, 2022 at 8:02 PM ying.huang@xxxxxxxxx
<ying.huang@xxxxxxxxx> wrote:
2. For machines with PMEM installed in only 1 of 2 sockets, for example,
Node 0 & 2 are cpu + dram nodes and node 1 are slow
memory node near node 0,
available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: n MB
node 0 free: n MB
node 1 cpus:
node 1 size: n MB
node 1 free: n MB
node 2 cpus: 2 3
node 2 size: n MB
node 2 free: n MB
node distances:
node 0 1 2
0: 10 40 20
1: 40 10 80
2: 20 80 10
We have 2 choices,
a)
node demotion targets
0 1
2 1
b)
node demotion targets
0 1
2 X
a) is good to take advantage of PMEM. b) is good to reduce cross-socket
traffic. Both are OK as defualt configuration. But some users may
prefer the other one. So we need a user space ABI to override the
default configuration.
I think 2(a) should be the system-wide configuration and 2(b) can be
achieved with NUMA mempolicy (which needs to be added to demotion).
In general, we can view the demotion order in a way similar to
allocation fallback order (after all, if we don't demote or demotion
lags behind, the allocations will go to these demotion target nodes
according to the allocation fallback order anyway). If we initialize
the demotion order in that way (i.e. every node can demote to any node
in the next tier, and the priority of the target nodes is sorted for
each source node), we don't need per-node demotion order override from
the userspace. What we need is to specify what nodes should be in
each tier and support NUMA mempolicy in demotion.
Cross-socket demotion should not be too big a problem in practice-aneesh
because we can optimize the code to do the demotion from the local CPU
node (i.e. local writes to the target node and remote read from the
source node). The bigger issue is cross-socket memory access onto the
demoted pages from the applications, which is why NUMA mempolicy is
important here.