Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface

From: Huang, Ying
Date: Wed Jan 31 2024 - 01:54:33 EST


"Yasunori Gotou (Fujitsu)" <y-goto@xxxxxxxxxxx> writes:

> Hello,
>
>> Li Zhijian <lizhijian@xxxxxxxxxxx> writes:
>>
>> > Hi Ying
>> >
>> > I need to pick up this thread/patch again.
>> >
>> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>> >> already. A node in a higher tier can demote to any node in the lower
>> >> tiers. What's more need to be displayed in nodeX/demotion_nodes?
>> >>
>> >
>> > Yes, it's believed that
>> > /sys/devices/virtual/memory_tiering/memory_tierN/nodelist
>> > are intended to show nodes in memory_tierN. But IMHO, it's not enough,
>> > especially for the preferred demotion node(s).
>> >
>> > Currently, when a demotion occurs, it will prioritize selecting a node
>> > from the preferred nodes as the destination node for the demotion. If
>> > the preferred nodes does not meet the requirements, it will try from
>> > all the lower memory tier nodes until it finds a suitable demotion
>> > destination node or ultimately fails.
>> >
>> > However, currently it only lists the nodes of each tier. If the
>> > administrators want to know all the possible demotion destinations for
>> > a given node, they need to calculate it themselves:
>> > Step 1, find the memory tier where the given node is located Step 2,
>> > list all nodes under all its lower tiers
>> >
>> > It will be even more difficult to know the preferred nodes which
>> > depend on more factors, distance etc. For the following example, we
>> > may have 6 nodes splitting into three memory tiers.
>> >
>> > For emulated hmat numa topology example:
>> >> $ numactl -H
>> >> available: 6 nodes (0-5)
>> >> node 0 cpus: 0
>> >> node 0 size: 1974 MB
>> >> node 0 free: 1767 MB
>> >> node 1 cpus: 1
>> >> node 1 size: 1694 MB
>> >> node 1 free: 1454 MB
>> >> node 2 cpus:
>> >> node 2 size: 896 MB
>> >> node 2 free: 896 MB
>> >> node 3 cpus:
>> >> node 3 size: 896 MB
>> >> node 3 free: 896 MB
>> >> node 4 cpus:
>> >> node 4 size: 896 MB
>> >> node 4 free: 896 MB
>> >> node 5 cpus:
>> >> node 5 size: 896 MB
>> >> node 5 free: 896 MB
>> >> node distances:
>> >> node 0 1 2 3 4 5
>> >> 0: 10 31 21 41 21 41
>> >> 1: 31 10 41 21 41 21
>> >> 2: 21 41 10 51 21 51
>> >> 3: 31 21 51 10 51 21
>> >> 4: 21 41 21 51 10 51
>> >> 5: 31 21 51 21 51 10
>> >> $ cat memory_tier4/nodelist
>> >> 0-1
>> >> $ cat memory_tier12/nodelist
>> >> 2,5
>> >> $ cat memory_tier54/nodelist
>> >> 3-4
>> >
>> > For above topology, memory-tier will build the demotion path for each
>> > node like this:
>> > node[0].preferred = 2
>> > node[0].demotion_targets = 2-5
>> > node[1].preferred = 5
>> > node[1].demotion_targets = 2-5
>> > node[2].preferred = 4
>> > node[2].demotion_targets = 3-4
>> > node[3].preferred = <empty>
>> > node[3].demotion_targets = <empty>
>> > node[4].preferred = <empty>
>> > node[4].demotion_targets = <empty>
>> > node[5].preferred = 3
>> > node[5].demotion_targets = 3-4
>> >
>> > But this demotion path is not explicitly known to administrator. And
>> > with the feedback from our customers, they also think it is helpful to
>> > know demotion path built by kernel to understand the demotion
>> > behaviors.
>> >
>> > So i think we should have 2 new interfaces for each node:
>> >
>> > /sys/devices/system/node/nodeN/demotion_allowed_nodes
>> > /sys/devices/system/node/nodeN/demotion_preferred_nodes
>> >
>> > I value your opinion, and I'd like to know what you think about...
>>
>> Per my understanding, we will not expose everything inside kernel to user
>> space. For page placement in a tiered memory system, demotion is just a part
>> of the story. For example, if the DRAM of a system becomes full, new page
>> allocation will fall back to the CXL memory. Have we exposed the default page
>> allocation fallback order to user space?
>
> In extreme terms, users want to analyze all the memory behaviors of memory management
> while executing their workload, and want to trace ALL of them if possible.
> Of course, it is impossible due to the heavy load, then users want to have other ways as
> a compromise. Our request, the demotion target information, is just one of them.
>
> In my impression, users worry about the impact of the CXL memory device on their workload,
> and want to have a way to understand the impact.
> If they know there is no information to remove their anxious, they may avoid to buy CXL memory.
>
> In addition, our support team also needs to have clues to solve users' performance problems.
> Even if new page allocation will fall back to the CXL memory, we need to explain why it would
> happen as accountability.

I guess

/proc/<PID>/numa_maps
/sys/fs/cgroup/<CGNAME>/memory.numa_stat

may help to understand system behavior.

--
Best Regards,
Huang, Ying

>>
>> All in all, in my opinion, we only expose as little as possible to user space
>> because we need to maintain the ABI for ever.
>
> I can understand there is a compatibility problem by our propose, and kernel may
> change its logic in future. This is a tug-of-war situation between kernel developers
> and users or support engineers. I suppose It often occurs in many place...
>
> Hmm... I hope there is a new idea to solve this situation even if our proposal is rejected..
> Anyone?
>
> Thanks,
> ----
> Yasunori Goto
>
>>
>> --
>> Best Regards,
>> Huang, Ying
>>
>> >
>> > On 02/11/2023 11:17, Huang, Ying wrote:
>> >> Li Zhijian <lizhijian@xxxxxxxxxxx> writes:
>> >>
>> >>> It shows the demotion target nodes of a node. Export this
>> >>> information to user directly.
>> >>>
>> >>> Below is an example where node0 node1 are DRAM, node3 is a PMEM
>> node.
>> >>> - Before PMEM is online, no demotion_nodes for node0 and node1.
>> >>> $ cat /sys/devices/system/node/node0/demotion_nodes
>> >>> <show nothing>
>> >>> - After node3 is online as kmem
>> >>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 &&
>> >>> daxctl online-memory dax0.0 [
>> >>> {
>> >>> "chardev":"dax0.0",
>> >>> "size":1054867456,
>> >>> "target_node":3,
>> >>> "align":2097152,
>> >>> "mode":"system-ram",
>> >>> "online_memblocks":0,
>> >>> "total_memblocks":7
>> >>> }
>> >>> ]
>> >>> $ cat /sys/devices/system/node/node0/demotion_nodes
>> >>> 3
>> >>> $ cat /sys/devices/system/node/node1/demotion_nodes
>> >>> 3
>> >>> $ cat /sys/devices/system/node/node3/demotion_nodes
>> >>> <show nothing>
>> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist
>> >> already. A node in a higher tier can demote to any node in the lower
>> >> tiers. What's more need to be displayed in nodeX/demotion_nodes?
>> >> --
>> >> Best Regards,
>> >> Huang, Ying
>> >>
>> >>> Signed-off-by: Li Zhijian <lizhijian@xxxxxxxxxxx>
>> >>> ---
>> >>> drivers/base/node.c | 13 +++++++++++++
>> >>> include/linux/memory-tiers.h | 6 ++++++
>> >>> mm/memory-tiers.c | 8 ++++++++
>> >>> 3 files changed, 27 insertions(+)
>> >>>
>> >>> diff --git a/drivers/base/node.c b/drivers/base/node.c index
>> >>> 493d533f8375..27e8502548a7 100644
>> >>> --- a/drivers/base/node.c
>> >>> +++ b/drivers/base/node.c
>> >>> @@ -7,6 +7,7 @@
>> >>> #include <linux/init.h>
>> >>> #include <linux/mm.h>
>> >>> #include <linux/memory.h>
>> >>> +#include <linux/memory-tiers.h>
>> >>> #include <linux/vmstat.h>
>> >>> #include <linux/notifier.h>
>> >>> #include <linux/node.h>
>> >>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device
>> *dev,
>> >>> }
>> >>> static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>> >>> +static ssize_t demotion_nodes_show(struct device *dev,
>> >>> + struct device_attribute *attr, char *buf) {
>> >>> + int ret;
>> >>> + nodemask_t nmask = next_demotion_nodes(dev->id);
>> >>> +
>> >>> + ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
>> >>> + return ret;
>> >>> +}
>> >>> +static DEVICE_ATTR_RO(demotion_nodes);
>> >>> +
>> >>> static struct attribute *node_dev_attrs[] = {
>> >>> &dev_attr_meminfo.attr,
>> >>> &dev_attr_numastat.attr,
>> >>> &dev_attr_distance.attr,
>> >>> &dev_attr_vmstat.attr,
>> >>> + &dev_attr_demotion_nodes.attr,
>> >>> NULL
>> >>> };
>> >>> diff --git a/include/linux/memory-tiers.h
>> >>> b/include/linux/memory-tiers.h index 437441cdf78f..8eb04923f965
>> >>> 100644
>> >>> --- a/include/linux/memory-tiers.h
>> >>> +++ b/include/linux/memory-tiers.h
>> >>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct
>> memory_dev_type *default_type);
>> >>> void clear_node_memory_type(int node, struct memory_dev_type
>> *memtype);
>> >>> #ifdef CONFIG_MIGRATION
>> >>> int next_demotion_node(int node);
>> >>> +nodemask_t next_demotion_nodes(int node);
>> >>> void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t
>> *targets);
>> >>> bool node_is_toptier(int node);
>> >>> #else
>> >>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node)
>> >>> return NUMA_NO_NODE;
>> >>> }
>> >>> +static inline next_demotion_nodes next_demotion_nodes(int node)
>> >>> +{
>> >>> + return NODE_MASK_NONE;
>> >>> +}
>> >>> +
>> >>> static inline void node_get_allowed_targets(pg_data_t *pgdat,
>> nodemask_t *targets)
>> >>> {
>> >>> *targets = NODE_MASK_NONE;
>> >>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index
>> >>> 37a4f59d9585..90047f37d98a 100644
>> >>> --- a/mm/memory-tiers.c
>> >>> +++ b/mm/memory-tiers.c
>> >>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat,
>> nodemask_t *targets)
>> >>> rcu_read_unlock();
>> >>> }
>> >>> +nodemask_t next_demotion_nodes(int node)
>> >>> +{
>> >>> + if (!node_demotion)
>> >>> + return NODE_MASK_NONE;
>> >>> +
>> >>> + return node_demotion[node].preferred; }
>> >>> +
>> >>> /**
>> >>> * next_demotion_node() - Get the next node in the demotion path
>> >>> * @node: The starting node to lookup the next node