Re: [RFC PATCH] mm: support large folio numa balancing

From: Baolin Wang
Date: Mon Nov 20 2023 - 03:01:39 EST




On 11/17/2023 6:07 PM, Mel Gorman wrote:
On Wed, Nov 15, 2023 at 10:58:32AM +0800, Huang, Ying wrote:
Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> writes:

On 11/14/2023 9:12 AM, Huang, Ying wrote:
David Hildenbrand <david@xxxxxxxxxx> writes:

On 13.11.23 11:45, Baolin Wang wrote:
Currently, the file pages already support large folio, and supporting for
anonymous pages is also under discussion[1]. Moreover, the numa balancing
code are converted to use a folio by previous thread[2], and the migrate_pages
function also already supports the large folio migration.
So now I did not see any reason to continue restricting NUMA
balancing for
large folio.

I recall John wanted to look into that. CCing him.

I'll note that the "head page mapcount" heuristic to detect sharers will
now strike on the PTE path and make us believe that a large folios is
exclusive, although it isn't.
Even 4k folio may be shared by multiple processes/threads. So, numa
balancing uses a multi-stage node selection algorithm (mostly
implemented in should_numa_migrate_memory()) to identify shared folios.
I think that the algorithm needs to be adjusted for PTE mapped large
folio for shared folios.

Not sure I get you here. In should_numa_migrate_memory(), it will use
last CPU id, last PID and group numa faults to determine if this page
can be migrated to the target node. So for large folio, a precise
folio sharers check can make the numa faults of a group more accurate,
which is enough for should_numa_migrate_memory() to make a decision?

A large folio that is mapped by multiple process may be accessed by one
remote NUMA node, so we still want to migrate it. A large folio that is
mapped by one process but accessed by multiple threads on multiple NUMA
node may be not migrated.


This leads into a generic problem with large anything with NUMA
balancing -- false sharing. As it stands, THP can be false shared by
threads if thread-local data is split within a THP range. In this case,
the ideal would be the THP is migrated to the hottest node but such
support doesn't exist. The same applies for folios. If not handled

So below check in should_numa_migrate_memory() can not avoid the false sharing of large folio you mentioned? Please correct me if I missed anything.

/*
* Destination node is much more heavily used than the source
* node? Allow migration.
*/
if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) *
ACTIVE_NODE_FRACTION)
return true;

/*
* Distribute memory according to CPU & memory use on each node,
* with 3/4 hysteresis to avoid unnecessary memory migrations:
*
* faults_cpu(dst) 3 faults_cpu(src)
* --------------- * - > ---------------
* faults_mem(dst) 4 faults_mem(src)
*/
return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid) * 3 >
group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;


properly, a large folio of any type can ping-pong between nodes so just
migrating because we can is not necessarily a good idea. The patch
should cover a realistic case why this matters, why splitting the folio
is not better and supporting data.

Sure. For a private mapping, we should always migrate the large folio. The tricky part is the shared mapping as you and Ying said, which can have different scenarios, and I'm thinking about how to validate it. Do you have any suggestion? Thanks.