Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers

From: Aneesh Kumar K V
Date: Wed Jun 15 2022 - 02:23:39 EST


On 6/15/22 12:26 AM, Johannes Weiner wrote:

....

>> What can happen is two devices that are managed by DAX/kmem that
>> should be in two memory tiers get assigned the same memory tier
>> because the dax/kmem driver added both the device to the same memory tier.
>>
>> In the future we would avoid that by using more device properties like HMAT
>> to create additional memory tiers with different rank values. ie, we would
>> do in the dax/kmem create_tier_from_rank() .
>
> Yes, that's the type of collision I mean. Two GPUs, two CXL-attached
> DRAMs of different speeds etc.
>
> I also like Huang's idea of using latency characteristics instead of
> abstract distances. Though I'm not quite sure how feasible this is in
> the short term, and share some concerns that Jonathan raised. But I
> think a wider possible range to begin with makes sense in any case.
>

How about the below proposal?

In this proposal, we use the tier ID as the value that determines the position
of the memory tier in the demotion order. A higher value of tier ID indicates a
higher memory tier. Memory demotion happens from a higher memory tier to a lower
memory tier.

By default memory get hotplugged into 'default_memory_tier' . There is a core
kernel parameter "default_memory_tier" which can be updated if the user wants to
modify the default tier ID.

dax/kmem driver use the "dax_kmem_memtier" module parameter to determine the
memory tier to which DAX/kmem memory will be added.

dax_kmem_memtier and default_memtier defaults to 100 and 200 respectively.

Later as we update dax/kmem to use additional device attributes, the driver will
be able to place new devices in different memory tiers. As we do that, it is
expected that users will have the ability to override these device attribute and
control which memory tiers the devices will be placed.

New memory tiers can also be created by using node/memtier attribute.
Moving a NUMA node to a non-existing memory tier results in creating
new memory tiers. So if the kernel default placement of memory devices
in memory tiers is not preferred, userspace could choose to create a
completely new memory tier hierarchy using this interface. Memory tiers
get deleted when they ends up with empty nodelist.

# cat /sys/module/kernel/parameters/default_memory_tier
200
# cat /sys/module/kmem/parameters/dax_kmem_memtier
100

# ls /sys/devices/system/memtier/
default_tier max_tier memtier200 power uevent
# ls /sys/devices/system/memtier/memtier200/nodelist
/sys/devices/system/memtier/memtier200/nodelist
# cat /sys/devices/system/memtier/memtier200/nodelist
1-3
# echo 20 > /sys/devices/system/node/node1/memtier
#
# ls /sys/devices/system/memtier/
default_tier max_tier memtier20 memtier200 power uevent
# cat /sys/devices/system/memtier/memtier20/nodelist
1
#

# echo 10 > /sys/module/kmem/parameters/dax_kmem_memtier
# echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
# echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
#
# ls /sys/devices/system/memtier/
default_tier max_tier memtier10 memtier20 memtier200 power uevent
# cat /sys/devices/system/memtier/memtier10/nodelist
4
#

# grep . /sys/devices/system/memtier/memtier*/nodelist
/sys/devices/system/memtier/memtier10/nodelist:4
/sys/devices/system/memtier/memtier200/nodelist:2-3
/sys/devices/system/memtier/memtier20/nodelist:1

demotion order details for the above will be
lower tier mask for node 1 is 4 and preferred demotion node is 4
lower tier mask for node 2 is 1,4 and preferred demotion node is 1
lower tier mask for node 3 is 1,4 and preferred demotion node is 1
lower tier mask for node 4 None

:/sys/devices/system/memtier# ls
default_tier max_tier memtier10 memtier20 memtier200 power uevent
:/sys/devices/system/memtier# cat memtier20/nodelist
1
:/sys/devices/system/memtier# echo 200 > ../node/node1/memtier
:/sys/devices/system/memtier# ls
default_tier max_tier memtier10 memtier200 power uevent
:/sys/devices/system/memtier#




>>> In the other email I had suggested the ability to override not just
>>> the per-device distance, but also the driver default for new devices
>>> to handle the hotplug situation.
>>>

.....

>>
>> Can you elaborate more on how distance value will be used? The device/device NUMA node can have
>> different distance value from other NUMA nodes. How do we group them?
>> for ex: earlier discussion did outline three different topologies. Can you
>> ellaborate how we would end up grouping them using distance?
>>
>> For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes
>> so how will we classify node 2?
>>
>>
>> Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>>
>> 20
>> Node 0 (DRAM) ---- Node 1 (DRAM)
>> | \ / |
>> | 30 40 X 40 | 30
>> | / \ |
>> Node 2 (PMEM) ---- Node 3 (PMEM)
>> 40
>>
>> node distances:
>> node 0 1 2 3
>> 0 10 20 30 40
>> 1 20 10 40 30
>> 2 30 40 10 40
>> 3 40 30 40 10
>
> I'm fairly confused by this example. Do all nodes have CPUs? Isn't
> this just classic NUMA, where optimizing for locality makes the most
> sense, rather than tiering?
>

Node 2 and Node3 will be memory only NUMA nodes.

> Forget the interface for a second, I have no idea how tiering on such
> a system would work. One CPU's lower tier can be another CPU's
> toptier. There is no lowest rung from which to actually *reclaim*
> pages. Would the CPUs just demote in circles?
>
> And the coldest pages on one socket would get demoted into another
> socket and displace what that socket considers hot local memory?
>
> I feel like I missing something.
>
> When we're talking about tiered memory, I'm thinking about CPUs
> utilizing more than one memory node. If those other nodes have CPUs,
> you can't reliably establish a singular tier order anymore and it
> becomes classic NUMA, no?