Re: [RFC v2] KVM: x86/vmx: Suppress posted interrupt notification when CPU is in host

From: Sean Christopherson
Date: Mon Sep 26 2022 - 13:11:20 EST


On Fri, Sep 23, 2022, Chao Gao wrote:
> Set PID.SN right after VM exits and clear it before VM entry to minimize
> the chance of hardware issuing PINs to a CPU when it's in host.

Toggling PID.SN as close to the world switch as possible is undesirable. If KVM
re-enters the guest without enabling IRQs, i.e. handles the VM-Exit in the fastpath,
then the notification IRQ will be delivered in the guest.

The natural location to do the toggling is when KVM "toggles" software, i.e. when
KVM sets IN_GUEST_MODE (clear SN) and OUTSIDE_GUEST_MODE (set SN).

I believe that would also obviate the need to manually send a PI Notification IRQ,
as the existing ->sync_pir_to_irr() call that exists to handle exactly this case
(notification not sent or handled in host) would ensure any outstanding posted IRQ
gets moved to the IRR and processed accordingly.

> Opportunistically clean up vmx_vcpu_pi_put(); when a vCPU is preempted,

Uh uh, this patch is already way, way too subtle and complex to tack on clean up.
"Opportunistic" clean up is for cases where the clean up is a pure refactoring
and/or has zero impact on functionality.

> it is pointless to update PID.NV to wakeup vector since notification is
> anyway suppressed. And since PID.SN should be already set for running
> vCPUs, so, don't set it again for preempted vCPUs.

I'm pretty sure this is wrong. If the vCPU is preempted between prepare_to_rcuwait()
and schedule(), then skipping pi_enable_wakeup_handler() will hang the guest if
the wakeup event is a posted IRQ and the event arrives while the vCPU is preempted.

> When IPI virtualization is enabled, this patch increases "perf bench" [*]
> by 6.56%, and PIN count in 1 second drops from tens of thousands to
> hundreds. But cpuid loop test shows this patch causes 1.58% overhead in
> VM-exit round-trip latency.

The overhead is more than likely due to pi_is_pir_empty() in the VM-Entry path,
i.e. should in theory go away if PID.SN is clear/set at IN_GUEST_MODE and
OUTSIDE_GUEST_MODE

> Also honour PID.SN bit in vmx_deliver_posted_interrupt().

Why?

> When IPI virtualization is enabled, this patch increases "perf bench" [*]
> by 6.56%, and PIN count in 1 second drops from tens of thousands to
> hundreds. But cpuid loop test shows this patch causes 1.58% overhead in
> VM-exit round-trip latency.
>
> [*] test cmd: perf bench sched pipe -T. Note that we change the source
> code to pin two threads to two different vCPUs so that it can reproduce
> stable results.
>
> Signed-off-by: Chao Gao <chao.gao@xxxxxxxxx>
> ---
> RFC: I am not sure whether the benefits outweighs the extra VM-exit cost.
>
> Changes in v2 (addressed comments from Kevin):
> - measure/estimate the impact to non-IPC-intensive cases
> - don't tie PID.SN to vcpu->mode. Instead, clear PID.SN
> right before VM-entry and set it after VM-exit.

Ah, sorry, missed v1. Rather than key off of IN_GUEST_MODE in the sync path, add
an explicit kvm_x86_ops hook to perform the transition. I.e. make it explict.

> - use !pi_is_pir_empty() in pi_enable_wakeup_handler() to
> check if any interrupt was posted after clearing SN.
> - clean up vmx_vcpu_pi_put(). See comments above.
>
> arch/x86/kvm/lapic.c | 2 ++
> arch/x86/kvm/vmx/posted_intr.c | 55 +++++++++++-----------------------
> arch/x86/kvm/vmx/vmx.c | 34 ++++++++++++++++++++-
> 3 files changed, 53 insertions(+), 38 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 9dda989a1cf0..a9f27c6ce937 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -128,7 +128,9 @@ static inline int __apic_test_and_clear_vector(int vec, void *bitmap)
> }
>
> __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_hw_disabled, HZ);
> +EXPORT_SYMBOL_GPL(apic_hw_disabled);
> __read_mostly DEFINE_STATIC_KEY_DEFERRED_FALSE(apic_sw_disabled, HZ);
> +EXPORT_SYMBOL_GPL(apic_sw_disabled);

These exports will hopefully go away.

> static inline int apic_enabled(struct kvm_lapic *apic)
> {
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 1b56c5e5c9fb..9cec3dd88f75 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -70,12 +70,6 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> * needs to be changed.
> */
> if (pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR && vcpu->cpu == cpu) {
> - /*
> - * Clear SN if it was set due to being preempted. Again, do
> - * this even if there is no assigned device for simplicity.
> - */
> - if (pi_test_and_clear_sn(pi_desc))
> - goto after_clear_sn;
> return;
> }
>
> @@ -101,11 +95,16 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> new.control = old.control;
>
> /*
> - * Clear SN (as above) and refresh the destination APIC ID to
> - * handle task migration (@cpu != vcpu->cpu).
> + * Set SN and refresh the destination APIC ID to handle
> + * task migration (@cpu != vcpu->cpu).
> + *
> + * SN is cleared when a vCPU goes to blocked state so that
> + * the blocked vCPU can be waken up on receiving a
> + * notification. For a running/runnable vCPU, such
> + * notifications are useless. Set SN bit to suppress them.
> */
> new.ndst = dest;
> - new.sn = 0;
> + new.sn = 1;

To handle the preempted case, I believe the correct behavior is to leave SN
as-is, although that would require setting SN=1 during vCPU creation. Arguably
KVM should do that anyways when APICv is enabled.

Hmm, or alternatively this should do the same?

new.sn = !kvm_vcpu_is_blocking();

> @@ -172,8 +160,10 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
> * enabled until it is safe to call try_to_wake_up() on the task being
> * scheduled out).
> */
> - if (pi_test_on(&new))
> + if (!pi_is_pir_empty(pi_desc)) {
> + pi_set_on(pi_desc);

As much as I wish we could get rid of kvm_arch_vcpu_blocking(), I actually think
this would be a good application of that hook. If PID.SN is cleared during
kvm_arch_vcpu_blocking() and set during kvm_arch_vcpu_unblocking(), then I believe
there's no need to manually check the PIR here, as any IRQ that isn't detected by
kvm_vcpu_check_block() is guaranteed to set PID.ON=1.

> apic->send_IPI_self(POSTED_INTR_WAKEUP_VECTOR);
> + }
>
> local_irq_restore(flags);
> }
> @@ -193,21 +183,12 @@ static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu)
>
> void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
> {
> - struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> -
> if (!vmx_needs_pi_wakeup(vcpu))
> return;
>
> - if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
> + if (!vcpu->preempted && kvm_vcpu_is_blocking(vcpu) &&

As above, I don't think skipping this for preempted is correct.

> + !vmx_interrupt_blocked(vcpu))
> pi_enable_wakeup_handler(vcpu);
> -
> - /*
> - * Set SN when the vCPU is preempted. Note, the vCPU can both be seen
> - * as blocking and preempted, e.g. if it's preempted between setting
> - * its wait state and manually scheduling out.
> - */

Assuming I'm correct, hoist the "note" above to document that KVM needs to enable
the wakeup handler even when the vCPU is preempted.

> - if (vcpu->preempted)
> - pi_set_sn(pi_desc);
> }
>
> /*
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c9b49a09e6b5..949fb80eca3d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4186,6 +4186,9 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
> if (pi_test_and_set_pir(vector, &vmx->pi_desc))
> return 0;
>
> + if (pi_test_sn(&vmx->pi_desc))
> + return 0;

Should be unnecessary.