Re: [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving

From: Huang, Ying
Date: Tue Jan 23 2024 - 03:42:25 EST


Gregory Price <gourry.memverge@xxxxxxxxx> writes:

> When a system has multiple NUMA nodes and it becomes bandwidth hungry,
> using the current MPOL_INTERLEAVE could be an wise option.
>
> However, if those NUMA nodes consist of different types of memory such
> as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin
> based interleave policy does not optimally distribute data to make use
> of their different bandwidth characteristics.
>
> Instead, interleave is more effective when the allocation policy follows
> each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.
>
> This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
> enabling weighted interleave between NUMA nodes. Weighted interleave
> allows for proportional distribution of memory across multiple numa
> nodes, preferably apportioned to match the bandwidth of each node.
>
> For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
> with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate
> weight distribution is (2:1).
>
> Weights for each node can be assigned via the new sysfs extension:
> /sys/kernel/mm/mempolicy/weighted_interleave/
>
> For now, the default value of all nodes will be `1`, which matches
> the behavior of standard 1:1 round-robin interleave. An extension
> will be added in the future to allow default values to be registered
> at kernel and device bringup time.
>
> The policy allocates a number of pages equal to the set weights. For
> example, if the weights are (2,1), then 2 pages will be allocated on
> node0 for every 1 page allocated on node1.
>
> The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
> and mbind(2).
>
> There are 3 integration points:
>
> weighted_interleave_nodes:
> Counts the number of allocations as they occur, and applies the
> weight for the current node. When the weight reaches 0, switch
> to the next node.
>
> weighted_interleave_nid:
> Gets the total weight of the nodemask as well as each individual
> node weight, then calculates the node based on the given index.
>
> bulk_array_weighted_interleave:
> Gets the total weight of the nodemask as well as each individual
> node weight, then calculates the number of "interleave rounds" as
> well as any delta ("partial round"). Calculates the number of
> pages for each node and allocates them.
>
> If a node was scheduled for interleave via interleave_nodes, the
> current weight (pol->cur_weight) will be allocated first, before
> the remaining bulk calculation is done.
>
> One piece of complexity is the interaction between a recent refactor
> which split the logic to acquire the "ilx" (interleave index) of an
> allocation and the actually application of the interleave. The
> calculation of the `interleave index` is done by `get_vma_policy()`,
> while the actual selection of the node will be later appliex by the
> relevant weighted_interleave function.
>
> Suggested-by: Hasan Al Maruf <Hasan.Maruf@xxxxxxx>
> Signed-off-by: Gregory Price <gregory.price@xxxxxxxxxxxx>
> Co-developed-by: Rakie Kim <rakie.kim@xxxxxx>
> Signed-off-by: Rakie Kim <rakie.kim@xxxxxx>
> Co-developed-by: Honggyu Kim <honggyu.kim@xxxxxx>
> Signed-off-by: Honggyu Kim <honggyu.kim@xxxxxx>
> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@xxxxxx>
> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@xxxxxx>
> Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@xxxxxxxxxx>
> Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@xxxxxxxxxx>
> Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@xxxxxxxxxx>
> Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@xxxxxxxxxx>
> ---
> .../admin-guide/mm/numa_memory_policy.rst | 9 +
> include/linux/mempolicy.h | 5 +
> include/uapi/linux/mempolicy.h | 1 +
> mm/mempolicy.c | 234 +++++++++++++++++-
> 4 files changed, 246 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index eca38fa81e0f..a70f20ce1ffb 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
> can fall back to all existing numa nodes. This is effectively
> MPOL_PREFERRED allowed for a mask rather than a single node.
>
> +MPOL_WEIGHTED_INTERLEAVE
> + This mode operates the same as MPOL_INTERLEAVE, except that
> + interleaving behavior is executed based on weights set in
> + /sys/kernel/mm/mempolicy/weighted_interleave/
> +
> + Weighted interleave allocates pages on nodes according to a
> + weight. For example if nodes [0,1] are weighted [5,2], 5 pages
> + will be allocated on node0 for every 2 pages allocated on node1.
> +
> NUMA memory policy supports the following optional mode flags:
>
> MPOL_F_STATIC_NODES
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 931b118336f4..c1a083eb0dd5 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -54,6 +54,11 @@ struct mempolicy {
> nodemask_t cpuset_mems_allowed; /* relative to these nodes */
> nodemask_t user_nodemask; /* nodemask passed by user */
> } w;
> +
> + /* Weighted interleave settings */
> + struct {
> + u8 cur_weight;
> + } wil;
> };
>
> /*
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index a8963f7ef4c2..1f9bb10d1a47 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -23,6 +23,7 @@ enum {
> MPOL_INTERLEAVE,
> MPOL_LOCAL,
> MPOL_PREFERRED_MANY,
> + MPOL_WEIGHTED_INTERLEAVE,
> MPOL_MAX, /* always last member of enum */
> };
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 427bddf115df..aa3b2389d3e0 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -19,6 +19,13 @@
> * for anonymous memory. For process policy an process counter
> * is used.
> *
> + * weighted interleave
> + * Allocate memory interleaved over a set of nodes based on
> + * a set of weights (per-node), with normal fallback if it
> + * fails. Otherwise operates the same as interleave.
> + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated
> + * on node 0 for every 1 page allocated on node 1.
> + *
> * bind Only allocate memory on a specific set of nodes,
> * no fallback.
> * FIXME: memory is allocated starting with the first node
> @@ -313,6 +320,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
> policy->mode = mode;
> policy->flags = flags;
> policy->home_node = NUMA_NO_NODE;
> + policy->wil.cur_weight = 0;
>
> return policy;
> }
> @@ -425,6 +433,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
> .create = mpol_new_nodemask,
> .rebind = mpol_rebind_preferred,
> },
> + [MPOL_WEIGHTED_INTERLEAVE] = {
> + .create = mpol_new_nodemask,
> + .rebind = mpol_rebind_nodemask,
> + },
> };
>
> static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
> @@ -846,7 +858,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
>
> old = current->mempolicy;
> current->mempolicy = new;
> - if (new && new->mode == MPOL_INTERLEAVE)
> + if (new && (new->mode == MPOL_INTERLEAVE ||
> + new->mode == MPOL_WEIGHTED_INTERLEAVE))
> current->il_prev = MAX_NUMNODES-1;
> task_unlock(current);
> mpol_put(old);
> @@ -872,6 +885,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
> case MPOL_INTERLEAVE:
> case MPOL_PREFERRED:
> case MPOL_PREFERRED_MANY:
> + case MPOL_WEIGHTED_INTERLEAVE:
> *nodes = pol->nodes;
> break;
> case MPOL_LOCAL:
> @@ -956,6 +970,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
> } else if (pol == current->mempolicy &&
> pol->mode == MPOL_INTERLEAVE) {
> *policy = next_node_in(current->il_prev, pol->nodes);
> + } else if (pol == current->mempolicy &&
> + (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
> + if (pol->wil.cur_weight)
> + *policy = current->il_prev;
> + else
> + *policy = next_node_in(current->il_prev,
> + pol->nodes);

Per my understanding, we should always use "*policy = next_node_in()"
here, as in weighted_interleave_nodes().

> } else {
> err = -EINVAL;
> goto out;
> @@ -1785,7 +1806,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> pol = __get_vma_policy(vma, addr, ilx);
> if (!pol)
> pol = get_task_policy(current);
> - if (pol->mode == MPOL_INTERLEAVE) {
> + if (pol->mode == MPOL_INTERLEAVE ||
> + pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
> *ilx += vma->vm_pgoff >> order;
> *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
> }
> @@ -1835,6 +1857,28 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> return zone >= dynamic_policy_zone;
> }
>
> +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> +{
> + unsigned int next;
> + struct task_struct *me = current;
> + u8 __rcu *table;
> +
> + next = next_node_in(me->il_prev, policy->nodes);
> + if (next == MAX_NUMNODES)
> + return next;
> +
> + rcu_read_lock();
> + table = rcu_dereference(iw_table);
> + if (!policy->wil.cur_weight)
> + policy->wil.cur_weight = table ? table[next] : 1;
> + rcu_read_unlock();
> +
> + policy->wil.cur_weight--;
> + if (!policy->wil.cur_weight)
> + me->il_prev = next;
> + return next;
> +}
> +

[snip]

--
Best Regards,
Huang, Ying