RE: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

From: Srinivasulu Thanneeru
Date: Wed Jan 03 2024 - 00:26:51 EST


Micron Confidential

Hi Huang, Ying,

My apologies for wrong mail reply format, my mail client settings got changed on my PC.
Please find comments bellow inline.

Regards,
Srini


Micron Confidential
+AD4- -----Original Message-----
+AD4- From: Huang, Ying +ADw-ying.huang+AEA-intel.com+AD4-
+AD4- Sent: Monday, December 18, 2023 11:26 AM
+AD4- To: gregory.price +ADw-gregory.price+AEA-memverge.com+AD4-
+AD4- Cc: Srinivasulu Opensrc +ADw-sthanneeru.opensrc+AEA-micron.com+AD4AOw- linux-
+AD4- cxl+AEA-vger.kernel.org+ADs- linux-mm+AEA-kvack.org+ADs- Srinivasulu Thanneeru
+AD4- +ADw-sthanneeru+AEA-micron.com+AD4AOw- aneesh.kumar+AEA-linux.ibm.com+ADs-
+AD4- dan.j.williams+AEA-intel.com+ADs- mhocko+AEA-suse.com+ADs- tj+AEA-kernel.org+ADs-
+AD4- john+AEA-jagalactic.com+ADs- Eishan Mirakhur +ADw-emirakhur+AEA-micron.com+AD4AOw- Vinicius
+AD4- Tavares Petrucci +ADw-vtavarespetr+AEA-micron.com+AD4AOw- Ravis OpenSrc
+AD4- +ADw-Ravis.OpenSrc+AEA-micron.com+AD4AOw- Jonathan.Cameron+AEA-huawei.com+ADs- linux-
+AD4- kernel+AEA-vger.kernel.org+ADs- Johannes Weiner +ADw-hannes+AEA-cmpxchg.org+AD4AOw- Wei Xu
+AD4- +ADw-weixugc+AEA-google.com+AD4-
+AD4- Subject: +AFs-EXT+AF0- Re: +AFs-RFC PATCH v2 0/2+AF0- Node migration between memory tiers
+AD4-
+AD4- CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
+AD4- you recognize the sender and were expecting this message.
+AD4-
+AD4-
+AD4- Gregory Price +ADw-gregory.price+AEA-memverge.com+AD4- writes:
+AD4-
+AD4- +AD4- On Fri, Dec 15, 2023 at 01:02:59PM +-0800, Huang, Ying wrote:
+AD4- +AD4APg- +ADw-sthanneeru.opensrc+AEA-micron.com+AD4- writes:
+AD4- +AD4APg-
+AD4- +AD4APg- +AD4- +AD0APQA9AD0APQA9AD0APQA9AD0APQA9AD0-
+AD4- +AD4APg- +AD4- Version Notes:
+AD4- +AD4APg- +AD4-
+AD4- +AD4APg- +AD4- V2 : Changed interface to memtier+AF8-override from adistance+AF8-offset.
+AD4- +AD4APg- +AD4- memtier+AF8-override was recommended by
+AD4- +AD4APg- +AD4- 1. John Groves +ADw-john+AEA-jagalactic.com+AD4-
+AD4- +AD4APg- +AD4- 2. Ravi Shankar +ADw-ravis.opensrc+AEA-micron.com+AD4-
+AD4- +AD4APg- +AD4- 3. Brice Goglin +ADw-Brice.Goglin+AEA-inria.fr+AD4-
+AD4- +AD4APg-
+AD4- +AD4APg- It appears that you ignored my comments for V1 as follows ...
+AD4- +AD4APg-
+AD4- +AD4APg-
+AD4- https://lore.k/
+AD4- ernel.org+ACU-2Flkml+ACU-2F87o7f62vur.fsf+ACU-40yhuang6-
+AD4- desk2.ccr.corp.intel.com+ACU-2F+ACY-data+AD0-05+ACU-7C02+ACU-7Csthanneeru+ACU-40micron.com
+AD4- +ACU-7C5e614e5f028342b6b59c08dbff8e3e37+ACU-7Cf38a5ecd28134862b11bac1d56
+AD4- 3c806f+ACU-7C0+ACU-7C0+ACU-7C638384758666895965+ACU-7CUnknown+ACU-7CTWFpbGZsb3d
+AD4- 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0+ACU-3
+AD4- D+ACU-7C3000+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-OpMkYCar+ACU-2Fv8uHb7AvXbmaNltnXeTvcNUTi
+AD4- bLhwV12Fg+ACU-3D+ACY-reserved+AD0-0

Thank you, Huang, Ying for pointing to this.
https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live+ACU-20In+ACU-20a+ACU-20World+ACU-20With+ACU-20Multiple+ACU-20Memory+ACU-20Types.pdf

In the presentation above, the adistance+AF8-offsets are per memtype.
We believe that adistance+AF8-offset per node is more suitable and flexible.
since we can change it per node. If we keep adistance+AF8-offset per memtype,
then we cannot change it for a specific node of a given memtype.

+AD4- +AD4APg-
+AD4- https://lore.k/
+AD4- ernel.org+ACU-2Flkml+ACU-2F87jzpt2ft5.fsf+ACU-40yhuang6-
+AD4- desk2.ccr.corp.intel.com+ACU-2F+ACY-data+AD0-05+ACU-7C02+ACU-7Csthanneeru+ACU-40micron.com
+AD4- +ACU-7C5e614e5f028342b6b59c08dbff8e3e37+ACU-7Cf38a5ecd28134862b11bac1d56
+AD4- 3c806f+ACU-7C0+ACU-7C0+ACU-7C638384758666895965+ACU-7CUnknown+ACU-7CTWFpbGZsb3d
+AD4- 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0+ACU-3
+AD4- D+ACU-7C3000+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-O0+ACU-2B6T+ACU-2FgU0TicCEYBac+ACU-2FAyjOLwAeouh
+AD4- D+ACU-2BcMI+ACU-2BflOsI1M+ACU-3D+ACY-reserved+AD0-0

Yes, memory+AF8-type would be grouping the related memories together as single tier.
We should also have a flexibility to move nodes between tiers, to address the issues.
described in use cases above.

+AD4- +AD4APg-
+AD4- https://lore.k/
+AD4- ernel.org+ACU-2Flkml+ACU-2F87a5qp2et0.fsf+ACU-40yhuang6-
+AD4- desk2.ccr.corp.intel.com+ACU-2F+ACY-data+AD0-05+ACU-7C02+ACU-7Csthanneeru+ACU-40micron.com
+AD4- +ACU-7C5e614e5f028342b6b59c08dbff8e3e37+ACU-7Cf38a5ecd28134862b11bac1d56
+AD4- 3c806f+ACU-7C0+ACU-7C0+ACU-7C638384758666895965+ACU-7CUnknown+ACU-7CTWFpbGZsb3d
+AD4- 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0+ACU-3
+AD4- D+ACU-7C3000+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-W+ACU-2FWcAD4b9od+ACU-2BS0zIak+ACU-2Bv5hkjFG1Xcf
+AD4- 6p8q3xwmspUiI+ACU-3D+ACY-reserved+AD0-0

This patch provides a way to move a node to the correct tier.
We observed in test setups where DRAM and CXL are put under the same.
tier (memory+AF8-tier4).
By using this patch, we can move the CXL node away from the DRAM-linked (memory+AF8-tier4)
and put it in the desired tier.

+AD4- +AD4APg-
+AD4- +AD4-
+AD4- +AD4- Not speaking for the group, just chiming in because i'd discussed it
+AD4- +AD4- with them.
+AD4- +AD4-
+AD4- +AD4- +ACI-Memory Type+ACI- is a bit nebulous. Is a Micron Type-3 with performance X
+AD4- +AD4- and an SK Hynix Type-3 with performance Y a +ACI-Different type+ACI-, or are
+AD4- +AD4- they the +ACI-Same Type+ACI- given that they're both Type 3 backed by some form
+AD4- +AD4- of DDR? Is socket placement of those devices relevant for determining
+AD4- +AD4- +ACI-Type+ACI-? Is whether they are behind a switch relevant for determining
+AD4- +AD4- +ACI-Type+ACI-? +ACI-Type+ACI- is frustrating when everything we're talking about
+AD4- +AD4- managing is +ACI-Type-3+ACI- with difference performance.
+AD4- +AD4-
+AD4- +AD4- A concrete example:
+AD4- +AD4- To the system, a Multi-Headed Single Logical Device (MH-SLD) looks
+AD4- +AD4- exactly the same as an standard SLD. I may want to have some
+AD4- +AD4- combination of local memory expansion devices on the majority of my
+AD4- +AD4- expansion slots, but reserve 1 slot on each socket for a connection to
+AD4- +AD4- the MH-SLD. As of right now: There is no good way to differentiate the
+AD4- +AD4- devices in terms of +ACI-Type+ACI- - and even if you had that, the tiering
+AD4- +AD4- system would still lump them together.
+AD4- +AD4-
+AD4- +AD4- Similarly, an initial run of switches may or may not allow enumeration
+AD4- +AD4- of devices behind it (depends on the configuration), so you may end up
+AD4- +AD4- with a static numa node that +ACI-looks like+ACI- another SLD - despite it being
+AD4- +AD4- some definition of +ACI-GFAM+ACI-. Do number of hops matter in determining
+AD4- +AD4- +ACI-Type+ACI-?
+AD4-
+AD4- In the original design, the memory devices of same memory type are
+AD4- managed by the same device driver, linked with system in same way
+AD4- (including switches), built with same media. So, the performance is
+AD4- same too. And, same as memory tiers, memory types are orthogonal to
+AD4- sockets. Do you think the definition itself is clear enough?
+AD4-
+AD4- I admit +ACI-memory type+ACI- is a confusing name. Do you have some better
+AD4- suggestion?
+AD4-
+AD4- +AD4- So I really don't think +ACI-Type+ACI- is useful for determining tier placement.
+AD4- +AD4-
+AD4- +AD4- As of right now, the system lumps DRAM nodes as one tier, and pretty
+AD4- +AD4- much everything else as +ACI-the other tier+ACI-. To me, this patch set is an
+AD4- +AD4- initial pass meant to allow user-control over tier composition while
+AD4- +AD4- the internal mechanism is sussed out and the environment develops.
+AD4-
+AD4- The patchset to identify the performance of memory devices and put them
+AD4- in proper +ACI-memory types+ACI- and memory tiers via HMAT has been merged by
+AD4- v6.7-rc1.
+AD4-
+AD4- 07a8bdd4120c (memory tiering: add abstract distance calculation
+AD4- algorithms management, 2023-09-26)
+AD4- d0376aac59a1 (acpi, hmat: refactor hmat+AF8-register+AF8-target+AF8-initiators(),
+AD4- 2023-09-26)
+AD4- 3718c02dbd4c (acpi, hmat: calculate abstract distance with HMAT, 2023-09-
+AD4- 26)
+AD4- 6bc2cfdf82d5 (dax, kmem: calculate abstract distance with general
+AD4- interface, 2023-09-26)
+AD4-
+AD4- +AD4- In general, a release valve that lets you redefine tiers is very welcome
+AD4- +AD4- for testing and validation of different setups while the industry evolves.
+AD4- +AD4-
+AD4- +AD4- Just my two cents.
+AD4-
+AD4- --
+AD4- Best Regards,
+AD4- Huang, Ying