Re: [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes

From: Peter Zijlstra
Date: Mon Feb 14 2022 - 10:06:04 EST


On Mon, Feb 14, 2022 at 08:15:52PM +0800, Huang Ying wrote:

> This isn't a practical problem now yet. Because the PMEM nodes (node
> 2 and node 3 in example system) are offlined by default during system
> boot. So init_numa_topology_type() called during system boot will
> ignore them and set sched_numa_topology_type to NUMA_DIRECT. And
> init_numa_topology_type() is only called at runtime when a CPU of a
> never-onlined-before node gets plugged in. And there's no CPU in the
> PMEM nodes. But it appears better to fix this to make the code more
> robust.

IIRC there are pre-existing issues with this; namely the distance_map is
created for all nodes, online or not, therefore the levels and
max_distance include the pmem stuff.

At the same time, the numa_topolog_type() uses those values, and the
only reason it 'worked' is because the combination of arguments fails to
hit any of the existing types and exits without setting a type,
defaulting to NUMA_DIRECT by 'accident' of that being type 0 and
bss/data being 0 initialized.

Also, Power (and possibly other architectures) already have CPU-less
nodes and are similarly suffering issues.

Anyway, aside from this the patches look like they should do.

There's a few niggles, like using READ_ONCE() on sched_max_numa_distance
without using WRITE_ONCE() (see below) and having
sched_domains_numa_distance and sched_domains_numa_masks separate RCU
variables (that could go side-ways if there were a function using both,
afaict there isn't and I couldn't be bothered changing that, but it's
something to keep in mind).

I'll go queue these, thanks!

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1259,11 +1259,10 @@ static bool numa_is_active_node(int nid,

/* Handle placement on systems where not all nodes are directly connected. */
static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
- int maxdist, bool task)
+ int lim_dist, bool task)
{
unsigned long score = 0;
- int node;
- int sys_max_dist;
+ int node, max_dist;

/*
* All nodes are directly connected, and the same distance
@@ -1273,7 +1272,7 @@ static unsigned long score_nearby_nodes(
return 0;

/* sched_max_numa_distance may be changed in parallel. */
- sys_max_dist = READ_ONCE(sched_max_numa_distance);
+ max_dist = READ_ONCE(sched_max_numa_distance);
/*
* This code is called for each node, introducing N^2 complexity,
* which should be ok given the number of nodes rarely exceeds 8.
@@ -1286,7 +1285,7 @@ static unsigned long score_nearby_nodes(
* The furthest away nodes in the system are not interesting
* for placement; nid was already counted.
*/
- if (dist >= sys_max_dist || node == nid)
+ if (dist >= max_dist || node == nid)
continue;

/*
@@ -1296,8 +1295,7 @@ static unsigned long score_nearby_nodes(
* "hoplimit", only nodes closer by than "hoplimit" are part
* of each group. Skip other nodes.
*/
- if (sched_numa_topology_type == NUMA_BACKPLANE &&
- dist >= maxdist)
+ if (sched_numa_topology_type == NUMA_BACKPLANE && dist >= lim_dist)
continue;

/* Add up the faults from nearby nodes. */
@@ -1315,8 +1313,8 @@ static unsigned long score_nearby_nodes(
* This seems to result in good task placement.
*/
if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
- faults *= (sys_max_dist - dist);
- faults /= (sys_max_dist - LOCAL_DISTANCE);
+ faults *= (max_dist - dist);
+ faults /= (max_dist - LOCAL_DISTANCE);
}

score += faults;
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1927,7 +1927,7 @@ void sched_init_numa(int offline_node)
sched_domain_topology = tl;

sched_domains_numa_levels = nr_levels;
- sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
+ WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);

init_numa_topology_type(offline_node);
}