Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface

From: Li Zhijian
Date: Tue Jan 30 2024 - 04:06:03 EST


Hi Ying


I need to pick up this thread/patch again.

We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
already. A node in a higher tier can demote to any node in the lower
tiers. What's more need to be displayed in nodeX/demotion_nodes?


Yes, it's believed that /sys/devices/virtual/memory_tiering/memory_tierN/nodelist
are intended to show nodes in memory_tierN. But IMHO, it's not enough, especially
for the preferred demotion node(s).

Currently, when a demotion occurs, it will prioritize selecting a node
from the preferred nodes as the destination node for the demotion. If
the preferred nodes does not meet the requirements, it will try from all
the lower memory tier nodes until it finds a suitable demotion destination
node or ultimately fails.
However, currently it only lists the nodes of each tier. If the
administrators want to know all the possible demotion destinations for a
given node, they need to calculate it themselves:
Step 1, find the memory tier where the given node is located
Step 2, list all nodes under all its lower tiers
It will be even more difficult to know the preferred nodes which depend on
more factors, distance etc. For the following example, we may have 6 nodes
splitting into three memory tiers.
For emulated hmat numa topology example:
$ numactl -H available: 6 nodes (0-5) node 0 cpus: 0 node 0 size: 1974 MB node 0 free: 1767 MB node 1 cpus: 1 node 1 size: 1694 MB node 1 free: 1454 MB node 2 cpus: node 2 size: 896 MB node 2 free: 896 MB node 3 cpus: node 3 size: 896 MB node 3 free: 896 MB node 4 cpus: node 4 size: 896 MB node 4 free: 896 MB node 5 cpus: node 5 size: 896 MB node 5 free: 896 MB node distances: node 0 1 2 3 4 5 0: 10 31 21 41 21 41 1: 31 10 41 21 41 21 2: 21 41 10 51 21 51 3: 31 21 51 10 51 21 4: 21 41 21 51 10 51 5: 31 21 51 21 51 10 $ cat memory_tier4/nodelist 0-1 $ cat memory_tier12/nodelist 2,5
$ cat memory_tier54/nodelist 3-4
For above topology, memory-tier will build the demotion path for each node
like this:
node[0].preferred = 2
node[0].demotion_targets = 2-5
node[1].preferred = 5
node[1].demotion_targets = 2-5
node[2].preferred = 4
node[2].demotion_targets = 3-4
node[3].preferred = <empty>
node[3].demotion_targets = <empty>
node[4].preferred = <empty>
node[4].demotion_targets = <empty>
node[5].preferred = 3
node[5].demotion_targets = 3-4
But this demotion path is not explicitly known to administrator. And with the
feedback from our customers, they also think it is helpful to know demotion
path built by kernel to understand the demotion behaviors.

So i think we should have 2 new interfaces for each node:

/sys/devices/system/node/nodeN/demotion_allowed_nodes
/sys/devices/system/node/nodeN/demotion_preferred_nodes

I value your opinion, and I'd like to know what you think about...


Thanks
Zhijian


On 02/11/2023 11:17, Huang, Ying wrote:
Li Zhijian <lizhijian@xxxxxxxxxxx> writes:

It shows the demotion target nodes of a node. Export this information to
user directly.

Below is an example where node0 node1 are DRAM, node3 is a PMEM node.
- Before PMEM is online, no demotion_nodes for node0 and node1.
$ cat /sys/devices/system/node/node0/demotion_nodes
<show nothing>
- After node3 is online as kmem
$ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && daxctl online-memory dax0.0
[
{
"chardev":"dax0.0",
"size":1054867456,
"target_node":3,
"align":2097152,
"mode":"system-ram",
"online_memblocks":0,
"total_memblocks":7
}
]
$ cat /sys/devices/system/node/node0/demotion_nodes
3
$ cat /sys/devices/system/node/node1/demotion_nodes
3
$ cat /sys/devices/system/node/node3/demotion_nodes
<show nothing>

We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
already. A node in a higher tier can demote to any node in the lower
tiers. What's more need to be displayed in nodeX/demotion_nodes?

--
Best Regards,
Huang, Ying

Signed-off-by: Li Zhijian <lizhijian@xxxxxxxxxxx>
---
drivers/base/node.c | 13 +++++++++++++
include/linux/memory-tiers.h | 6 ++++++
mm/memory-tiers.c | 8 ++++++++
3 files changed, 27 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 493d533f8375..27e8502548a7 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -7,6 +7,7 @@
#include <linux/init.h>
#include <linux/mm.h>
#include <linux/memory.h>
+#include <linux/memory-tiers.h>
#include <linux/vmstat.h>
#include <linux/notifier.h>
#include <linux/node.h>
@@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device *dev,
}
static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
+static ssize_t demotion_nodes_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ int ret;
+ nodemask_t nmask = next_demotion_nodes(dev->id);
+
+ ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
+ return ret;
+}
+static DEVICE_ATTR_RO(demotion_nodes);
+
static struct attribute *node_dev_attrs[] = {
&dev_attr_meminfo.attr,
&dev_attr_numastat.attr,
&dev_attr_distance.attr,
&dev_attr_vmstat.attr,
+ &dev_attr_demotion_nodes.attr,
NULL
};
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 437441cdf78f..8eb04923f965 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type);
void clear_node_memory_type(int node, struct memory_dev_type *memtype);
#ifdef CONFIG_MIGRATION
int next_demotion_node(int node);
+nodemask_t next_demotion_nodes(int node);
void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
bool node_is_toptier(int node);
#else
@@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
return NUMA_NO_NODE;
}
+static inline next_demotion_nodes next_demotion_nodes(int node)
+{
+ return NODE_MASK_NONE;
+}
+
static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
{
*targets = NODE_MASK_NONE;
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 37a4f59d9585..90047f37d98a 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
rcu_read_unlock();
}
+nodemask_t next_demotion_nodes(int node)
+{
+ if (!node_demotion)
+ return NODE_MASK_NONE;
+
+ return node_demotion[node].preferred;
+}
+
/**
* next_demotion_node() - Get the next node in the demotion path
* @node: The starting node to lookup the next node