Re: [RFC] KVM: arm/arm64: optimize vSGI injection performance

From: zhaoxu
Date: Tue Aug 22 2023 - 23:19:59 EST




On 2023/8/22 16:28, Marc Zyngier wrote:
On Tue, 22 Aug 2023 04:51:30 +0100,
zhaoxu <zhaoxu.35@xxxxxxxxxxxxx> wrote:
In fact, the core vCPU search algorithm remains the same in the latest
kernel: iterate all vCPUs, if mpidr matches, inject. next version will
based on latest kernel.

My point is that performance numbers on such an ancient kernel hardly
make any sense, as a large portion of the code will be different. We
aim to live in the future, not in the past.

Yes, i got it, thanks.

- which current guest OS *currently* make use of broadcast or 1:N
SGIs? Linux doesn't and overall SGI multicasting is pretty useless
to an OS.

[...]
Yes, arm64 linux almost never send broadcast ipi. I will use another
test data to prove performence improvement

Exactly. I also contend that *no* operating system uses broadcast (or
even multicast) signalling, because this is a very pointless
operation.

So what are you optimising for?

Explanation at the end.

/*
- * Compare a given affinity (level 1-3 and a level 0 mask, from the SGI
- * generation register ICC_SGI1R_EL1) with a given VCPU.
- * If the VCPU's MPIDR matches, return the level0 affinity, otherwise
- * return -1.
+ * Get affinity routing index from ICC_SGI_* register
+ * format:
+ * aff3 aff2 aff1 aff0
+ * |- 8 bits -|- 8 bits -|- 8 bits -|- 4 bits or 8bits -|

OK, so you are implementing RSS support:

- Why isn't that mentioned anywhere in the commit log?

- Given that KVM actively limits the MPIDR to 4 bits at Aff0, how does
it even work the first place?

- How is that advertised to the guest?

- How can the guest enable RSS support?

thanks to mention that, I also checked the relevant code, guest can't
enable RSS, it was my oversight. This part has removed in next
version.

Then what's the point of your patch? You don't explain anything, which
makes it very hard to guess what you're aiming for.
This patch aims to optimize the vCPU search algorithm when injecting vSGI.

For example, in a 64-core VM, the CPU topology consists of 4 aff0 groups (0-15, 16-31, 32-47, 48-63). When the guest wants to send a SGI to core 63, in the previous logic, kvm needs to iterate over all vCPUs to identify core 63 using the kvm_for_each_vcpu function, and then inject the vSGI into it. However, the ICC_SGI* register provides affinity routing information, enabling us to bypass the initial three aff0 groups, starting with the last one. As a result, the iteration times will reduced from the number of vCPUs (64 in this case) to 16 or 8 times(Using a mask to determine the distribution of a target list in ICC_SGI* register).

This optimization effect is evident under the following conditions: 1. A VM with more than 16 cores. 2. The inject target vCPU is located after the 16th core. Therefore, this patch must ensure that the performance will not deteriorate when the inject target is aff0 group (core 0-15), that’s the reason why I put these test data in the patch.

M.

Xu.