Re: [PATCH V5 net-next] net: mana: Assigning IRQ affinity on HT cores

From: Souradeep Chakrabarti
Date: Tue Dec 12 2023 - 06:39:05 EST


On Mon, Dec 11, 2023 at 07:30:46AM -0800, Yury Norov wrote:
> On Sun, Dec 10, 2023 at 10:37:26PM -0800, Souradeep Chakrabarti wrote:
> > On Fri, Dec 08, 2023 at 06:03:39AM -0800, Yury Norov wrote:
> > > On Fri, Dec 08, 2023 at 02:02:34AM -0800, Souradeep Chakrabarti wrote:
> > > > Existing MANA design assigns IRQ to every CPU, including sibling
> > > > hyper-threads. This may cause multiple IRQs to be active simultaneously
> > > > in the same core and may reduce the network performance with RSS.
> > >
> > > Can you add an IRQ distribution diagram to compare before/after
> > > behavior, similarly to what I did in the other email?
> > >
> > Let's consider this topology:
>
> Not here - in commit message, please.
Okay, will do that.
>
> >
> > Node 0 1
> > Core 0 1 2 3
> > CPU 0 1 2 3 4 5 6 7
> >
> > Before
> > IRQ Nodes Cores CPUs
> > 0 1 0 0
> > 1 1 1 2
> > 2 1 0 1
> > 3 1 1 3
> > 4 2 2 4
> > 5 2 3 6
> > 6 2 2 5
> > 7 2 3 7
> >
> > Now
> > IRQ Nodes Cores CPUs
> > 0 1 0 0-1
> > 1 1 1 2-3
> > 2 1 0 0-1
> > 3 1 1 2-3
> > 4 2 2 4-5
> > 5 2 3 6-7
> > 6 2 2 4-5
> > 7 2 3 6-7
>
> If you decided to take my wording, please give credits.
>
Will take care of that :).
> > > > Improve the performance by assigning IRQ to non sibling CPUs in local
> > > > NUMA node. The performance improvement we are getting using ntttcp with
> > > > following patch is around 15 percent with existing design and approximately
> > > > 11 percent, when trying to assign one IRQ in each core across NUMA nodes,
> > > > if enough cores are present.
> > >
> > > How did you measure it? In the other email you said you used perf, can
> > > you show your procedure in details?
> > I have used ntttcp for performance analysis, by perf I had meant performance
> > analysis. I have used ntttcp with following parameters
> > ntttcp -r -m 64 <receiver>
> >
> > ntttcp -s <receiver side ip address> -m 64 <sender>
> > Both the VMs are in same Azure subnet and private IP address is used.
> > MTU and tcp buffer is set accordingly and number of channels are set using ethtool
> > accordingly for best performance. Also irqbalance was disabled.
> > https://github.com/microsoft/ntttcp-for-linux
> > https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-bandwidth-testing?tabs=linux
>
> OK. Can you also print the before/after outputs of ntttcp that demonstrate
> +15%?
>
With affinity spread like each core only 1 irq and spreading accross multiple NUMA node>
8 ./ntttcp -r -m 16
NTTTCP for Linux 1.4.0
---------------------------------------------------------
08:05:20 INFO: 17 threads created
08:05:28 INFO: Network activity progressing...
08:06:28 INFO: Test run completed.
08:06:28 INFO: Test cycle finished.
08:06:28 INFO: ##### Totals: #####
08:06:28 INFO: test duration :60.00 seconds
08:06:28 INFO: total bytes :630292053310
08:06:28 INFO: throughput :84.04Gbps
08:06:28 INFO: retrans segs :4
08:06:28 INFO: cpu cores :192
08:06:28 INFO: cpu speed :3799.725MHz
08:06:28 INFO: user :0.05%
08:06:28 INFO: system :1.60%
08:06:28 INFO: idle :96.41%
08:06:28 INFO: iowait :0.00%
08:06:28 INFO: softirq :1.94%
08:06:28 INFO: cycles/byte :2.50
08:06:28 INFO: cpu busy (all) :534.41%

With our new proposal

./ntttcp -r -m 16
NTTTCP for Linux 1.4.0
---------------------------------------------------------
08:08:51 INFO: 17 threads created
08:08:56 INFO: Network activity progressing...
08:09:56 INFO: Test run completed.
08:09:56 INFO: Test cycle finished.
08:09:56 INFO: ##### Totals: #####
08:09:56 INFO: test duration :60.00 seconds
08:09:56 INFO: total bytes :741966608384
08:09:56 INFO: throughput :98.93Gbps
08:09:56 INFO: retrans segs :6
08:09:56 INFO: cpu cores :192
08:09:56 INFO: cpu speed :3799.791MHz
08:09:56 INFO: user :0.06%
08:09:56 INFO: system :1.81%
08:09:56 INFO: idle :96.18%
08:09:56 INFO: iowait :0.00%
08:09:56 INFO: softirq :1.95%
08:09:56 INFO: cycles/byte :2.25
08:09:56 INFO: cpu busy (all) :569.22%
---------------------------------------------------------

> > > > Suggested-by: Yury Norov <yury.norov@xxxxxxxxx>
> > > > Signed-off-by: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx>
> > > > ---
> > >
> > > [...]
> > >
> > > > .../net/ethernet/microsoft/mana/gdma_main.c | 92 +++++++++++++++++--
> > > > 1 file changed, 83 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > index 6367de0c2c2e..18e8908c5d29 100644
> > > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > @@ -1243,15 +1243,56 @@ void mana_gd_free_res_map(struct gdma_resource *r)
> > > > r->size = 0;
> > > > }
> > > >
> > > > +static int irq_setup(int *irqs, int nvec, int start_numa_node)
>
> Is it intentional that irqs and nvec are signed? If not, please make
> them unsigned.
Will do it in next version.
>
> > > > +{
> > > > + int w, cnt, cpu, err = 0, i = 0;
> > > > + int next_node = start_numa_node;
> > >
> > > What for this?
> > This is the local numa node, from where to start hopping.
> > Please see how we are calling irq_setup(). We are passing the array of allocated irqs, total
> > number of irqs allocated, and the local numa node to the device.
>
> I'll ask again: you copy parameter (start_numa_node) to a local
> variable (next_node), and never use start_numa_node after that.
>
> You can just use the parameter, and avoid creating local variable at
> all, so what for the latter exist?
>
> The naming is confusing. I think just 'node' is OK here.
Thanks, I wll not use the extra variable next_node.
>
> > > > + const struct cpumask *next, *prev = cpu_none_mask;
> > > > + cpumask_var_t curr, cpus;
> > > > +
> > > > + if (!zalloc_cpumask_var(&curr, GFP_KERNEL)) {
> > > > + err = -ENOMEM;
> > > > + return err;
> > > > + }
> > > > + if (!zalloc_cpumask_var(&cpus, GFP_KERNEL)) {
> > >
> > > free(curr);
> > Will fix it in next version. Thanks for pointing.
>
> And also drop 'err' - just 'return -ENOMEM'.
>
Will fix it in next revision.
> > >
> > > > + err = -ENOMEM;
> > > > + return err;
> > > > + }
> > > > +
> > > > + rcu_read_lock();
> > > > + for_each_numa_hop_mask(next, next_node) {
> > > > + cpumask_andnot(curr, next, prev);
> > > > + for (w = cpumask_weight(curr), cnt = 0; cnt < w; ) {
> > > > + cpumask_copy(cpus, curr);
> > > > + for_each_cpu(cpu, cpus) {
> > > > + irq_set_affinity_and_hint(irqs[i], topology_sibling_cpumask(cpu));
> > > > + if (++i == nvec)
> > > > + goto done;
> > >
> > > Think what if you're passed with irq_setup(NULL, 0, 0).
> > > That's why I suggested to place this check at the beginning.
> > >
> > irq_setup() is a helper function for mana_gd_setup_irqs(), which already takes
> > care of no NULL pointer for irqs, and 0 number of interrupts can not be passed.
> >
> > nvec = pci_alloc_irq_vectors(pdev, 2, max_irqs, PCI_IRQ_MSIX);
> > if (nvec < 0)
> > return nvec;
>
> I know that. But still it's a bug. The common convention is that if a
> 0-length array is passed to a function, it should not dereference the
> pointer.
>
I will add one if check in the begining of irq_setup() to verify the pointer
and the nvec number.
> ...
>
> > > > @@ -1287,21 +1336,44 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
> > > > goto free_irq;
> > > > }
> > > >
> > > > - err = request_irq(irq, mana_gd_intr, 0, gic->name, gic);
> > > > - if (err)
> > > > - goto free_irq;
> > > > -
> > > > - cpu = cpumask_local_spread(i, gc->numa_node);
> > > > - irq_set_affinity_and_hint(irq, cpumask_of(cpu));
> > > > + if (!i) {
> > > > + err = request_irq(irq, mana_gd_intr, 0, gic->name, gic);
> > > > + if (err)
> > > > + goto free_irq;
> > > > +
> > > > + /* If number of IRQ is one extra than number of online CPUs,
> > > > + * then we need to assign IRQ0 (hwc irq) and IRQ1 to
> > > > + * same CPU.
> > > > + * Else we will use different CPUs for IRQ0 and IRQ1.
> > > > + * Also we are using cpumask_local_spread instead of
> > > > + * cpumask_first for the node, because the node can be
> > > > + * mem only.
> > > > + */
> > > > + if (start_irq_index) {
> > > > + cpu = cpumask_local_spread(i, gc->numa_node);
> > >
> > > I already mentioned that: if i == 0, you don't need to spread, just
> > > pick 1st cpu from node.
> > The reason I have picked cpumask_local_spread here, is that, the gc->numa_node
> > can be a memory only node, in that case we need to jump to next node to get the CPU.
> > Which cpumask_local_spread() using sched_numa_find_nth_cpu() takes care off.
>
> OK, makes sense.
>
> What if you need to distribute more IRQs than the number of CPUs? In
> that case you'd call the function many times. But because you return
> 0, user has no chance catch that. I think you should handle it inside
> the helper, or do like this:
>
> while (nvec) {
> distributed = irq_setup(irqs, nvec, node);
> if (distributed < 0)
> break;
>
> irq += distributed;
> nvec -= distributed;
> }
We can not have irqs more greater than 1 of num of online CPUs, as we are
setting it inside cpu_read_lock() with num_online_cpus().
We can have minimum 2 IRQs and max number_online_cpus() +1 or 64, which is
maximum supported IRQs per port.

1295 cpus_read_lock();
1296 max_queues_per_port = num_online_cpus();
1297 if (max_queues_per_port > MANA_MAX_NUM_QUEUES)
1298 max_queues_per_port = MANA_MAX_NUM_QUEUES;
1299
1300 /* Need 1 interrupt for the Hardware communication Channel (HWC) */
1301 max_irqs = max_queues_per_port + 1;
1302
1303 nvec = pci_alloc_irq_vectors(pdev, 2, max_irqs, PCI_IRQ_MSIX);
1304 if (nvec < 0)
1305 return nvec;
>
> Thanks,
> Yury