Re: [PATCH v2] pseries: prevent free CPU ids to be reused on another node

From: Nathan Lynch
Date: Fri Apr 02 2021 - 09:34:31 EST


Hi Laurent,

Laurent Dufour <ldufour@xxxxxxxxxxxxx> writes:
> When a CPU is hot added, the CPU ids are taken from the available mask from
> the lower possible set. If that set of values was previously used for CPU
> attached to a different node, this seems to application like if these CPUs
> have migrated from a node to another one which is not expected in real
> life.

This seems like a problem that could affect other architectures or
platforms? I guess as long as arch code is responsible for placing new
CPUs in cpu_present_mask, that code will have the responsibility of
ensuring CPU IDs' NUMA assignments remain stable.

[...]

> The effect of this patch can be seen by removing and adding CPUs using the
> Qemu monitor. In the following case, the first CPU from the node 2 is
> removed, then the first one from the node 1 is removed too. Later, the
> first CPU of the node 2 is added back. Without that patch, the kernel will
> numbered these CPUs using the first CPU ids available which are the ones
> freed when removing the second CPU of the node 0. This leads to the CPU ids
> 16-23 to move from the node 1 to the node 2. With the patch applied, the
> CPU ids 32-39 are used since they are the lowest free ones which have not
> been used on another node.
>
> At boot time:
> [root@vm40 ~]# numactl -H | grep cpus
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
> node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
>
> Vanilla kernel, after the CPU hot unplug/plug operations:
> [root@vm40 ~]# numactl -H | grep cpus
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node 1 cpus: 24 25 26 27 28 29 30 31
> node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
>
> Patched kernel, after the CPU hot unplug/plug operations:
> [root@vm40 ~]# numactl -H | grep cpus
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node 1 cpus: 24 25 26 27 28 29 30 31
> node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Good demonstration of the problem. CPUs 16-23 "move" from node 1 to
node 2.


> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 12cbffd3c2e3..48c7943b25b0 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -39,6 +39,8 @@
> /* This version can't take the spinlock, because it never returns */
> static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE;
>
> +static cpumask_var_t node_recorded_ids_map[MAX_NUMNODES];

I guess this should have documentation that it must be
accessed/manipulated with cpu_add_remove_lock held?


> +
> static void rtas_stop_self(void)
> {
> static struct rtas_args args;
> @@ -151,29 +153,61 @@ static void pseries_cpu_die(unsigned int cpu)
> */
> static int pseries_add_processor(struct device_node *np)
> {
> - unsigned int cpu;
> + unsigned int cpu, node;
> cpumask_var_t candidate_mask, tmp;
> - int err = -ENOSPC, len, nthreads, i;
> + int err = -ENOSPC, len, nthreads, i, nid;

>From eight local vars to ten, and the two new variables' names are
"node" and "nid". More distinctive names would help readers.


> const __be32 *intserv;
> + bool force_reusing = false;
>
> intserv = of_get_property(np, "ibm,ppc-interrupt-server#s", &len);
> if (!intserv)
> return 0;
>
> - zalloc_cpumask_var(&candidate_mask, GFP_KERNEL);
> - zalloc_cpumask_var(&tmp, GFP_KERNEL);
> + alloc_cpumask_var(&candidate_mask, GFP_KERNEL);
> + alloc_cpumask_var(&tmp, GFP_KERNEL);
> +
> + /*
> + * Fetch from the DT nodes read by dlpar_configure_connector() the NUMA
> + * node id the added CPU belongs to.
> + */
> + nid = of_node_to_nid(np);
> + if (nid < 0 || !node_possible(nid))
> + nid = first_online_node;
>
> nthreads = len / sizeof(u32);
> - for (i = 0; i < nthreads; i++)
> - cpumask_set_cpu(i, tmp);
>
> cpu_maps_update_begin();
>
> BUG_ON(!cpumask_subset(cpu_present_mask, cpu_possible_mask));
>
> +again:
> + cpumask_clear(candidate_mask);
> + cpumask_clear(tmp);
> + for (i = 0; i < nthreads; i++)
> + cpumask_set_cpu(i, tmp);
> +
> /* Get a bitmap of unoccupied slots. */
> cpumask_xor(candidate_mask, cpu_possible_mask, cpu_present_mask);
> +
> + /*
> + * Remove free ids previously assigned on the other nodes. We can walk
> + * only online nodes because once a node became online it is not turned
> + * offlined back.
> + */
> + if (!force_reusing)
> + for_each_online_node(node) {
> + if (node == nid) /* Keep our node's recorded ids */
> + continue;
> + cpumask_andnot(candidate_mask, candidate_mask,
> + node_recorded_ids_map[node]);
> + }
> +
> if (cpumask_empty(candidate_mask)) {
> + if (!force_reusing) {
> + force_reusing = true;
> + goto again;
> + }
> +

Hmm I'd encourage you to work toward a solution that doesn't involve
adding backwards jumps and a bool flag to this code.

The function already mixes concerns and this change makes it a bit more
difficult to follow. I'd suggest that you first factor out into a
separate function the parts that allocate a suitable range from
cpu_possible_mask, and only then introduce the behavior change
constraining the results.


> /* If we get here, it most likely means that NR_CPUS is
> * less than the partition's max processors setting.
> */
> @@ -191,12 +225,36 @@ static int pseries_add_processor(struct device_node *np)
> cpumask_shift_left(tmp, tmp, nthreads);
>
> if (cpumask_empty(tmp)) {
> + if (!force_reusing) {
> + force_reusing = true;
> + goto again;
> + }
> printk(KERN_ERR "Unable to find space in cpu_present_mask for"
> " processor %pOFn with %d thread(s)\n", np,
> nthreads);
> goto out_unlock;
> }
>
> + /* Record the newly used CPU ids for the associate node. */
> + cpumask_or(node_recorded_ids_map[nid], node_recorded_ids_map[nid], tmp);
> +
> + /*
> + * If we force reusing the id, remove these ids from any node which was
> + * previously using it.
> + */
> + if (force_reusing) {
> + cpu = cpumask_first(tmp);
> + pr_warn("Reusing free CPU ids %d-%d from another node\n",
> + cpu, cpu + nthreads - 1);
> +
> + for_each_online_node(node) {
> + if (node == nid)
> + continue;
> + cpumask_andnot(node_recorded_ids_map[node],
> + node_recorded_ids_map[node], tmp);
> + }
> + }
> +

I don't know, should we not fail the request instead of doing the
ABI-breaking thing the code in this change is trying to prevent? I don't
think a warning in the kernel log is going to help any application that
would be affected by this.


> for_each_cpu(cpu, tmp) {
> BUG_ON(cpu_present(cpu));
> set_cpu_present(cpu, true);
> @@ -889,6 +947,7 @@ static struct notifier_block pseries_smp_nb = {
> static int __init pseries_cpu_hotplug_init(void)
> {
> int qcss_tok;
> + unsigned int node;
>
> #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
> ppc_md.cpu_probe = dlpar_cpu_probe;
> @@ -910,8 +969,18 @@ static int __init pseries_cpu_hotplug_init(void)
> smp_ops->cpu_die = pseries_cpu_die;
>
> /* Processors can be added/removed only on LPAR */
> - if (firmware_has_feature(FW_FEATURE_LPAR))
> + if (firmware_has_feature(FW_FEATURE_LPAR)) {
> + for_each_node(node) {
> + alloc_bootmem_cpumask_var(&node_recorded_ids_map[node]);
> +
> + /* Record ids of CPU added at boot time */
> + cpumask_or(node_recorded_ids_map[node],
> + node_recorded_ids_map[node],
> + node_to_cpumask_map[node]);
> + }
> +
> of_reconfig_notifier_register(&pseries_smp_nb);
> + }

This looks OK.