Re: [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes

From: Huang, Ying
Date: Mon Feb 14 2022 - 20:29:10 EST


Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:

> On Mon, Feb 14, 2022 at 08:15:52PM +0800, Huang Ying wrote:
>
>> This isn't a practical problem now yet. Because the PMEM nodes (node
>> 2 and node 3 in example system) are offlined by default during system
>> boot. So init_numa_topology_type() called during system boot will
>> ignore them and set sched_numa_topology_type to NUMA_DIRECT. And
>> init_numa_topology_type() is only called at runtime when a CPU of a
>> never-onlined-before node gets plugged in. And there's no CPU in the
>> PMEM nodes. But it appears better to fix this to make the code more
>> robust.
>
> IIRC there are pre-existing issues with this; namely the distance_map is
> created for all nodes, online or not, therefore the levels and
> max_distance include the pmem stuff.
>
> At the same time, the numa_topolog_type() uses those values, and the
> only reason it 'worked' is because the combination of arguments fails to
> hit any of the existing types and exits without setting a type,
> defaulting to NUMA_DIRECT by 'accident' of that being type 0 and
> bss/data being 0 initialized.
>
> Also, Power (and possibly other architectures) already have CPU-less
> nodes and are similarly suffering issues.
>
> Anyway, aside from this the patches look like they should do.
>
> There's a few niggles, like using READ_ONCE() on sched_max_numa_distance
> without using WRITE_ONCE() (see below) and having
> sched_domains_numa_distance and sched_domains_numa_masks separate RCU
> variables (that could go side-ways if there were a function using both,
> afaict there isn't and I couldn't be bothered changing that, but it's
> something to keep in mind).

If you think that's necessary, I can work on it with a follow up patch.

> I'll go queue these, thanks!

Thanks!

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1259,11 +1259,10 @@ static bool numa_is_active_node(int nid,
>
> /* Handle placement on systems where not all nodes are directly connected. */
> static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
> - int maxdist, bool task)
> + int lim_dist, bool task)
> {
> unsigned long score = 0;
> - int node;
> - int sys_max_dist;
> + int node, max_dist;
>
> /*
> * All nodes are directly connected, and the same distance
> @@ -1273,7 +1272,7 @@ static unsigned long score_nearby_nodes(
> return 0;
>
> /* sched_max_numa_distance may be changed in parallel. */
> - sys_max_dist = READ_ONCE(sched_max_numa_distance);
> + max_dist = READ_ONCE(sched_max_numa_distance);
> /*
> * This code is called for each node, introducing N^2 complexity,
> * which should be ok given the number of nodes rarely exceeds 8.
> @@ -1286,7 +1285,7 @@ static unsigned long score_nearby_nodes(
> * The furthest away nodes in the system are not interesting
> * for placement; nid was already counted.
> */
> - if (dist >= sys_max_dist || node == nid)
> + if (dist >= max_dist || node == nid)
> continue;
>
> /*
> @@ -1296,8 +1295,7 @@ static unsigned long score_nearby_nodes(
> * "hoplimit", only nodes closer by than "hoplimit" are part
> * of each group. Skip other nodes.
> */
> - if (sched_numa_topology_type == NUMA_BACKPLANE &&
> - dist >= maxdist)
> + if (sched_numa_topology_type == NUMA_BACKPLANE && dist >= lim_dist)
> continue;
>
> /* Add up the faults from nearby nodes. */
> @@ -1315,8 +1313,8 @@ static unsigned long score_nearby_nodes(
> * This seems to result in good task placement.
> */
> if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
> - faults *= (sys_max_dist - dist);
> - faults /= (sys_max_dist - LOCAL_DISTANCE);
> + faults *= (max_dist - dist);
> + faults /= (max_dist - LOCAL_DISTANCE);
> }
>
> score += faults;
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1927,7 +1927,7 @@ void sched_init_numa(int offline_node)
> sched_domain_topology = tl;
>
> sched_domains_numa_levels = nr_levels;
> - sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
> + WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
>
> init_numa_topology_type(offline_node);
> }

Sorry, 0-Day building robot reported another small issue, can you fold
the following patch too?

Best Regards,
Huang, Ying

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 490acd4b835d..d72128151e56 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1947,7 +1947,7 @@ void sched_init_numa(int offline_node)
sched_max_numa_distance);
}

-void sched_reset_numa(void)
+static void sched_reset_numa(void)
{
int nr_levels, *distances;
struct cpumask ***masks;
--
2.30.2