Re: [PATCH] mm: migrate: Support multiple target nodes demotion

From: Baolin Wang
Date: Wed Nov 10 2021 - 05:44:36 EST




On 2021/11/10 16:51, Huang, Ying writes:
Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> writes:

We have some machines with multiple memory types like below, which
have one fast (DRAM) memory node and two slow (persistent memory) memory
nodes. According to current node demotion, if node 0 fills up,
~~~~~~~~~~~~~

node demotion policy?

Yes, will fix in next version.



its memory should be migrated to node 1, when node 1 fills up, its
memory will be migrated to node 2: node 0 -> node 1 -> node 2 ->stop.

But this is not efficient and suitbale memory migration route
for our machine with multiple slow memory nodes. Since the distance
between node 0 to node 1 and node 0 to node 2 is equal, and memory
migration between slow memory nodes will increase persistent memory
bandwidth greatly, which will hurt the whole system's performance.

Thus for this case, we can treat the slow memory node 1 and node 2
as a whole slow memory region, and we should migrate memory from
node 0 to node 1 and node 2 if node 0 fills up.

This patch changes the node_demotion data structure to support multiple
target nodes, and establishes the migration path to support multiple
target nodes with validating if the node distance is the best or not.

available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 62153 MB
node 0 free: 55135 MB
node 1 cpus:
node 1 size: 127007 MB
node 1 free: 126930 MB
node 2 cpus:
node 2 size: 126968 MB
node 2 free: 126878 MB
node distances:
node 0 1 2
0: 10 20 20
1: 20 10 20
2: 20 20 10

Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
---
Changes from RFC v2:
- Change to 'short' type for target nodes array.
- Remove nodemask instead selecting target node directly.
- Add WARN_ONCE() if the target nodes exceed the maximum value.

Changes from RFC v1:
- Re-define the node_demotion structure.
- Set up multiple target nodes by validating the node distance.
- Add more comments.
---
mm/migrate.c | 138 +++++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 102 insertions(+), 36 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index cf25b00..7f1d745 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -50,6 +50,7 @@
#include <linux/ptrace.h>
#include <linux/oom.h>
#include <linux/memory.h>
+#include <linux/random.h>
#include <asm/tlbflush.h>
@@ -1119,12 +1120,25 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
*
* This is represented in the node_demotion[] like this:
*
- * { 1, // Node 0 migrates to 1
- * 2, // Node 1 migrates to 2
- * -1, // Node 2 does not migrate
- * 4, // Node 3 migrates to 4
- * 5, // Node 4 migrates to 5
- * -1} // Node 5 does not migrate
+ * { nr=1, nodes[0]=1 }, // Node 0 migrates to 1
+ * { nr=1, nodes[0]=2 }, // Node 1 migrates to 2
+ * { nr=0, nodes[0]=-1 }, // Node 2 does not migrate
+ * { nr=1, nodes[0]=4 }, // Node 3 migrates to 4
+ * { nr=1, nodes[0]=5 }, // Node 4 migrates to 5
+ * { nr=0, nodes[0]=-1} // Node 5 does not migrate
+ *
+ * Moreover some systems may have multiple same class memory
+ * types. Suppose a system has one socket with 3 memory nodes,

s/same class memory types/slow memory nodes/

?

We don't support multiple fast memory types, right?

Until now we have no machines with multiple fast memory types. OK, I will change the words.


+ * node 0 is fast memory type, and node 1/2 both are slow memory
+ * type, and the distance between fast memory node and slow
+ * memory node is same. So the migration path should be:
+ *
+ * 0 -> 1/2 -> stop
+ *
+ * This is represented in the node_demotion[] like this:
+ * { nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
+ * { nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
+ * { nr=0, nodes[0]=-1, }, // Node 2 does not migrate
*/
/*
@@ -1135,8 +1149,13 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
* must be held over all reads to ensure that no cycles are
* observed.
*/
-static int node_demotion[MAX_NUMNODES] __read_mostly =
- {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE};
+#define DEMOTION_TARGET_NODES 15
+struct demotion_nodes {
+ unsigned short nr;
+ short nodes[DEMOTION_TARGET_NODES];
+};
+
+static struct demotion_nodes node_demotion[MAX_NUMNODES] __read_mostly;

If MAX_NUMNODES is 1024, the total size will be (16 * 2 * 1024) = 32K
bytes. That appears too large. We may consider to allocate
node_demotion[] dynamically.

Sure. I'd like to optimize it in a separate patch to keep current patch easy to review. Thanks.