Re: [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API

From: Tariq Toukan
Date: Tue Aug 09 2022 - 06:18:44 EST




On 8/9/2022 1:02 PM, Valentin Schneider wrote:
On 08/08/22 17:39, Tariq Toukan wrote:
On 8/4/2022 8:28 PM, Valentin Schneider wrote:
On 28/07/22 22:12, Tariq Toukan wrote:
+static bool sched_cpus_spread_by_distance(int node, u16 *cpus, int ncpus)
^^^^^^^^^
That should be a struct *cpumask.

With cpumask, we'll lose the order.


Right, I didn't get that from the changelog.

I'll make it clear when re-spinned.



+{
+ cpumask_var_t cpumask;
+ int first, i;
+
+ if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
+ return false;
+
+ cpumask_copy(cpumask, cpu_online_mask);
+
+ first = cpumask_first(cpumask_of_node(node));
+
+ for (i = 0; i < ncpus; i++) {
+ int cpu;
+
+ cpu = sched_numa_find_closest(cpumask, first);
+ if (cpu >= nr_cpu_ids) {
+ free_cpumask_var(cpumask);
+ return false;
+ }
+ cpus[i] = cpu;
+ __cpumask_clear_cpu(cpu, cpumask);
+ }
+
+ free_cpumask_var(cpumask);
+ return true;
+}

This will fail if ncpus > nr_cpu_ids, which shouldn't be a problem. It
would make more sense to set *up to* ncpus, the calling code can then
decide if getting fewer than requested is OK or not.

I also don't get the fallback to cpumask_local_spread(), is that if the
NUMA topology hasn't been initialized yet? It feels like most users of this
would invoke it late enough (i.e. anything after early initcalls) to have
the backing data available.

I don't expect this to fail, as we invoke it late enough. Fallback is
there just in case, to preserve the old behavior instead of getting
totally broken.


Then there shouldn't be a fallback method - the main method is expected to
work.


I'll drop the fallback then.


Finally, I think iterating only once per NUMA level would make more sense.

Agree, although it's just a setup stage.
I'll check if it can work for me, based on the reference code below.


I've scribbled something together from those thoughts, see below. This has
just the mlx5 bits touched to show what I mean, but that's just compile
tested.

My function returns a *sorted* list of the N closest cpus.
That is important. In many cases, drivers do not need all N irqs, but
only a portion of it, so it wants to use the closest subset of cpus.


Are there cases where we can't figure this out in advance? From what I grok
out of the two callsites you patched, all vectors will be used unless some
error happens, so compressing the CPUs in a single cpumask seemed
sufficient.


All vectors will be initialized to support the maximum number of traffic rings. However, the actual number of traffic rings can be controlled and set to a lower number N_actual < N. In this case, we'll be using only N_actual instances and we want them to be the first/closest.