Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface

From: Huang, Ying
Date: Thu Nov 02 2023 - 01:21:23 EST


"Zhijian Li (Fujitsu)" <lizhijian@xxxxxxxxxxx> writes:

>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>> already. A node in a higher tier can demote to any node in the lower
>> tiers. What's more need to be displayed in nodeX/demotion_nodes?
>
> IIRC, they are not the same. memory_tier[number], where the number is shared by
> the memory using the same memory driver(dax/kmem etc). Not reflect the actual distance
> across nodes(different distance will be grouped into the same memory_tier).
> But demotion will only select the nearest nodelist to demote.

In the following patchset, we will use the performance information from
HMAT to place nodes using the same memory driver into different memory
tiers.

https://lore.kernel.org/all/20230926060628.265989-1-ying.huang@xxxxxxxxx/

The patch is in mm-stable tree.

> Below is an example, node0 node1 are DRAM, node2 node3 are PMEM, but distance to DRAM nodes
> are different.
>
> # numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0
> node 0 size: 964 MB
> node 0 free: 746 MB
> node 1 cpus: 1
> node 1 size: 685 MB
> node 1 free: 455 MB
> node 2 cpus:
> node 2 size: 896 MB
> node 2 free: 897 MB
> node 3 cpus:
> node 3 size: 896 MB
> node 3 free: 896 MB
> node distances:
> node 0 1 2 3
> 0: 10 20 20 25
> 1: 20 10 25 20
> 2: 20 25 10 20
> 3: 25 20 20 10
> # cat /sys/devices/system/node/node0/demotion_nodes
> 2

node 2 is only the preferred demotion target. In fact, memory in node 0
can be demoted to node 2,3. Please check demote_folio_list() for
details.

--
Best Regards,
Huang, Ying

> # cat /sys/devices/system/node/node1/demotion_nodes
> 3
> # cat /sys/devices/virtual/memory_tiering/memory_tier22/nodelist
> 2-3
>
> Thanks
> Zhijian
>
> (I hate the outlook native reply composition format.)
> ________________________________________
> From: Huang, Ying <ying.huang@xxxxxxxxx>
> Sent: Thursday, November 2, 2023 11:17
> To: Li, Zhijian/李 智坚
> Cc: Andrew Morton; Greg Kroah-Hartman; rafael@xxxxxxxxxx; linux-mm@xxxxxxxxx; Gotou, Yasunori/五島 康文; linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface
>
> Li Zhijian <lizhijian@xxxxxxxxxxx> writes:
>
>> It shows the demotion target nodes of a node. Export this information to
>> user directly.
>>
>> Below is an example where node0 node1 are DRAM, node3 is a PMEM node.
>> - Before PMEM is online, no demotion_nodes for node0 and node1.
>> $ cat /sys/devices/system/node/node0/demotion_nodes
>> <show nothing>
>> - After node3 is online as kmem
>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && daxctl online-memory dax0.0
>> [
>> {
>> "chardev":"dax0.0",
>> "size":1054867456,
>> "target_node":3,
>> "align":2097152,
>> "mode":"system-ram",
>> "online_memblocks":0,
>> "total_memblocks":7
>> }
>> ]
>> $ cat /sys/devices/system/node/node0/demotion_nodes
>> 3
>> $ cat /sys/devices/system/node/node1/demotion_nodes
>> 3
>> $ cat /sys/devices/system/node/node3/demotion_nodes
>> <show nothing>
>
> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
> already. A node in a higher tier can demote to any node in the lower
> tiers. What's more need to be displayed in nodeX/demotion_nodes?
>
> --
> Best Regards,
> Huang, Ying
>
>> Signed-off-by: Li Zhijian <lizhijian@xxxxxxxxxxx>
>> ---
>> drivers/base/node.c | 13 +++++++++++++
>> include/linux/memory-tiers.h | 6 ++++++
>> mm/memory-tiers.c | 8 ++++++++
>> 3 files changed, 27 insertions(+)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index 493d533f8375..27e8502548a7 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -7,6 +7,7 @@
>> #include <linux/init.h>
>> #include <linux/mm.h>
>> #include <linux/memory.h>
>> +#include <linux/memory-tiers.h>
>> #include <linux/vmstat.h>
>> #include <linux/notifier.h>
>> #include <linux/node.h>
>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device *dev,
>> }
>> static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>>
>> +static ssize_t demotion_nodes_show(struct device *dev,
>> + struct device_attribute *attr, char *buf)
>> +{
>> + int ret;
>> + nodemask_t nmask = next_demotion_nodes(dev->id);
>> +
>> + ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
>> + return ret;
>> +}
>> +static DEVICE_ATTR_RO(demotion_nodes);
>> +
>> static struct attribute *node_dev_attrs[] = {
>> &dev_attr_meminfo.attr,
>> &dev_attr_numastat.attr,
>> &dev_attr_distance.attr,
>> &dev_attr_vmstat.attr,
>> + &dev_attr_demotion_nodes.attr,
>> NULL
>> };
>>
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> index 437441cdf78f..8eb04923f965 100644
>> --- a/include/linux/memory-tiers.h
>> +++ b/include/linux/memory-tiers.h
>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type);
>> void clear_node_memory_type(int node, struct memory_dev_type *memtype);
>> #ifdef CONFIG_MIGRATION
>> int next_demotion_node(int node);
>> +nodemask_t next_demotion_nodes(int node);
>> void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>> bool node_is_toptier(int node);
>> #else
>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
>> return NUMA_NO_NODE;
>> }
>>
>> +static inline next_demotion_nodes next_demotion_nodes(int node)
>> +{
>> + return NODE_MASK_NONE;
>> +}
>> +
>> static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>> {
>> *targets = NODE_MASK_NONE;
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> index 37a4f59d9585..90047f37d98a 100644
>> --- a/mm/memory-tiers.c
>> +++ b/mm/memory-tiers.c
>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>> rcu_read_unlock();
>> }
>>
>> +nodemask_t next_demotion_nodes(int node)
>> +{
>> + if (!node_demotion)
>> + return NODE_MASK_NONE;
>> +
>> + return node_demotion[node].preferred;
>> +}
>> +
>> /**
>> * next_demotion_node() - Get the next node in the demotion path
>> * @node: The starting node to lookup the next node