Re: [RFC] KVM: arm/arm64: optimize vSGI injection performance

From: Marc Zyngier
Date: Mon Aug 21 2023 - 06:16:39 EST


On Mon, 21 Aug 2023 09:59:17 +0100,
Mark Rutland <mark.rutland@xxxxxxx> wrote:
>
> [adding the KVM/arm64 maintainers & list]

Thanks for that.

>
> Mark.
>
> On Fri, Aug 18, 2023 at 06:47:04PM +0800, Xu Zhao wrote:
> > In the worst case scenario, it may iterate over all vCPUs in the vm in order to complete
> > injecting an SGI interrupt. However, the ICC_SGI_* register provides affinity routing information,
> > and we are interested in exploring the possibility of utilizing this information to reduce iteration
> > times from a total of vcpu numbers to 16 (the length of the targetlist), or even 8 times.
> >
> > This work is based on v5.4, and here is test data:

This is a 4 year old kernel. I'm afraid you'll have to provide
something that is relevant to a current (e.i. v6.5) kernel.

> > 4 cores with vcpu pinning:
> > | ipi benchmark | vgic_v3_dispatch_sgi |
> > | original | with patch | impoved | original | with patch | impoved |
> > | core0 -> core1 | 292610285 ns | 299856696 ns | -2.5% | 1471 ns | 1508 ns | -2.5% |
> > | core0 -> core3 | 333815742 ns | 327647989 ns | +1.8% | 1578 ns | 1532 ns | +2.9% |
> > | core0 -> all | 439754104 ns | 433987192 ns | +1.3% | 2970 ns | 2875 ns | +3.2% |
> >
> > 32 cores with vcpu pinning:
> > | ipi benchmark | vgic_v3_dispatch_sgi |
> > | original | with patch | impoved | original | with patch | impoved |
> > | core0 -> core1 | 269153219 ns | 261636906 ns | +2.8% | 1743 ns | 1706 ns | +2.1% |
> > | core0 -> core31 | 685199666 ns | 355250664 ns | +48.2% | 4238 ns | 1838 ns | +56.6% |
> > | core0 -> all | 7281278980 ns | 3403240773 ns | +53.3% | 30879 ns | 13843 ns | +55.2% |
> >
> > Based on the test results, the performance of vm with less than 16 cores remains almost the same,
> > while significant improvement can be observed with more than 16
> > cores.

This triggers multiple questions:

- what is the test being used? on what hardware? how can I reproduce
this data?

- which current guest OS *currently* make use of broadcast or 1:N
SGIs? Linux doesn't and overall SGI multicasting is pretty useless
to an OS.

[...]

> > /*
> > - * Compare a given affinity (level 1-3 and a level 0 mask, from the SGI
> > - * generation register ICC_SGI1R_EL1) with a given VCPU.
> > - * If the VCPU's MPIDR matches, return the level0 affinity, otherwise
> > - * return -1.
> > + * Get affinity routing index from ICC_SGI_* register
> > + * format:
> > + * aff3 aff2 aff1 aff0
> > + * |- 8 bits -|- 8 bits -|- 8 bits -|- 4 bits or 8bits -|

OK, so you are implementing RSS support:

- Why isn't that mentioned anywhere in the commit log?

- Given that KVM actively limits the MPIDR to 4 bits at Aff0, how does
it even work the first place?

- How is that advertised to the guest?

- How can the guest enable RSS support?

This is not following the GICv3 architecture, and I'm sceptical that
it actually works as is (I strongly suspect that you have additional
patches...).

M.

--
Without deviation from the norm, progress is not possible.