Re: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

From: Souradeep Chakrabarti
Date: Tue Jan 16 2024 - 01:14:01 EST

Next message: Tiezhu Yang: "[PATCH bpf-next v1] bpftool: Silence build warning about calloc()"
Previous message: Haotien Hsu: "[PATCH v5] ucsi_ccg: Refine the UCSI Interrupt handling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Jan 13, 2024 at 11:11:50AM -0800, Yury Norov wrote:
> On Sat, Jan 13, 2024 at 04:20:31PM +0000, Michael Kelley wrote:
> > From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> Sent: Friday, January 12, 2024 10:31 PM
> >
> > > On Fri, Jan 12, 2024 at 06:30:44PM +0000, Haiyang Zhang wrote:
> > > >
> > > > > -----Original Message-----
> > > > From: Michael Kelley <mhklinux@xxxxxxxxxxx> Sent: Friday, January 12, 2024 11:37 AM
> > > > >
> > > > > From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx> Sent:
> > > > > Wednesday, January 10, 2024 10:13 PM
> > > > > >
> > > > > > The test topology was used to check the performance between
> > > > > > cpu_local_spread() and the new approach is :
> > > > > > Case 1
> > > > > > IRQ Nodes Cores CPUs
> > > > > > 0 1 0 0-1
> > > > > > 1 1 1 2-3
> > > > > > 2 1 2 4-5
> > > > > > 3 1 3 6-7
> > > > > >
> > > > > > and with existing cpu_local_spread()
> > > > > > Case 2
> > > > > > IRQ Nodes Cores CPUs
> > > > > > 0 1 0 0
> > > > > > 1 1 0 1
> > > > > > 2 1 1 2
> > > > > > 3 1 1 3
> > > > > >
> > > > > > Total 4 channels were used, which was set up by ethtool.
> > > > > > case 1 with ntttcp has given 15 percent better performance, than
> > > > > > case 2. During the test irqbalance was disabled as well.
> > > > > >
> > > > > > Also you are right, with 64CPU system this approach will spread
> > > > > > the irqs like the cpu_local_spread() but in the future we will offer
> > > > > > MANA nodes, with more than 64 CPUs. There it this new design will
> > > > > > give better performance.
> > > > > >
> > > > > > I will add this performance benefit details in commit message of
> > > > > > next version.
> > > > >
> > > > > Here are my concerns:
> > > > >
> > > > > 1. The most commonly used VMs these days have 64 or fewer
> > > > > vCPUs and won't see any performance benefit.
> > > > >
> > > > > 2. Larger VMs probably won't see the full 15% benefit because
> > > > > all vCPUs in the local NUMA node will be assigned IRQs. For
> > > > > example, in a VM with 96 vCPUs and 2 NUMA nodes, all 48
> > > > > vCPUs in NUMA node 0 will all be assigned IRQs. The remaining
> > > > > 16 IRQs will be spread out on the 48 CPUs in NUMA node 1
> > > > > in a way that avoids sharing a core. But overall the means
> > > > > that 75% of the IRQs will still be sharing a core and
> > > > > presumably not see any perf benefit.
> > > > >
> > > > > 3. Your experiment was on a relatively small scale: 4 IRQs
> > > > > spread across 2 cores vs. across 4 cores. Have you run any
> > > > > experiments on VMs with 128 vCPUs (for example) where
> > > > > most of the IRQs are not sharing a core? I'm wondering if
> > > > > the results with 4 IRQs really scale up to 64 IRQs. A lot can
> > > > > be different in a VM with 64 cores and 2 NUMA nodes vs.
> > > > > 4 cores in a single node.
> > > > >
> > > > > 4. The new algorithm prefers assigning to all vCPUs in
> > > > > each NUMA hop over assigning to separate cores. Are there
> > > > > experiments showing that is the right tradeoff? What
> > > > > are the results if assigning to separate cores is preferred?
> > > >
> > > > I remember in a customer case, putting the IRQs on the same
> > > > NUMA node has better perf. But I agree, this should be re-tested
> > > > on MANA nic.
> > >
> > > 1) and 2) The change will not decrease the existing performance, but for
> > > system with high number of CPU, will be benefited after this.
> > >
> > > 3) The result has shown around 6 percent improvement.
> > >
> > > 4)The test result has shown around 10 percent difference when IRQs are
> > > spread on multiple numa nodes.
> >
> > OK, this looks pretty good. Make clear in the commit messages what
> > the tradeoffs are, and what the real-world benefits are expected to be.
> > Some future developer who wants to understand why IRQs are assigned
> > this way will thank you. :-)
>
> I agree with Michael, this needs to be spoken aloud.
>
> >From the above, is that correct that the best performance is achieved
> when the # of IRQs is half the nubmer of CPUs in the 1st node, because
> this configuration allows to spread IRQs across cores the most optimal
> way? And if we have more or less than that, it hurts performance, at
> least for MANA networking?
It does not decrease the performance from current cpu_local_spread(),
but optimum performance comes when node has CPUs double that of number
of IRQs (considering SMT==2).

Now only if the number of CPUs are same that of number of IRQs,
(that is num of CPUs <= 64) then, we see same performance like existing
design with cpu_local_spread().

If node has more CPUs than 64, then we get better performance than
cpu_local_spread().
>
> So, the B|A performance chart may look like this, right?
>
> irq nodes cores cpus perf
> 0 1 | 1 0 | 0 0 | 0-1 0%
> 1 1 | 1 0 | 1 1 | 2-3 +5%
> 2 1 | 1 1 | 2 2 | 4-5 +10%
> 3 1 | 1 1 | 3 3 | 6-7 +15%
> 4 1 | 1 0 | 4 3 | 0-1 +12%
> ... | | |
> 7 1 | 1 1 | 7 3 | 6-7 0%
> ...
> 15 2 | 2 3 | 3 15 | 14-15 0%
>
> Souradeep, can you please confirm that my understanding is correct?
>
> In v5, can you add a table like the above with real performance
> numbers for your driver? I think that it would help people to
> configure their VMs better when networking is a bottleneck.
>
I will share a chart on next version of patch 3.
Thanks for the suggestion.
> Thanks,
> Yury

Next message: Tiezhu Yang: "[PATCH bpf-next v1] bpftool: Silence build warning about calloc()"
Previous message: Haotien Hsu: "[PATCH v5] ucsi_ccg: Refine the UCSI Interrupt handling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]