[PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

From: Vitaly Kuznetsov
Date: Thu Mar 28 2019 - 16:31:19 EST


This is embarassing but we have another Windows/Hyper-V issue to workaround
in KVM (or QEMU). Hope "RFC" makes it less offensive.

It was noticed that Hyper-V guest on q35 KVM/QEMU VM hangs on boot if e.g.
'piix4-usb-uhci' device is attached. The problem with this device is that
it uses level-triggered interrupts.

The 'hang' scenario develops like this:
1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
of them is level-triggered. KVM injects the edge-triggered one and
requests immediate exit to inject the level-triggered:

kvm_set_irq: gsi 23 level 1 source 0
kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
kvm_msi_set_irq: dst 0 vec 96 (Fixed|physical|edge)
kvm_apic_accept_irq: apicid 0 vec 96 (Fixed|edge)
kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000060 int_info_err 0

2) Hyper-V requires one of its VMs to run to handle the situation but
immediate exit happens:

kvm_entry: vcpu 0
kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
kvm_entry: vcpu 0
kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0

3) Hyper-V doesn't want to deal with the second irq (as its VM still didn't
process the first one) so it just does 'EOI' for level-triggered interrupt
and this causes re-injection:

kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0
kvm_eoi: apicid 0 vector 80
kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
kvm_set_irq: gsi 23 level 1 source 0
kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
kvm_entry: vcpu 0

4) As we arm preemtion timer again we get stuck in the infinite loop:

kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
kvm_entry: vcpu 0
kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
kvm_entry: vcpu 0
kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0
kvm_eoi: apicid 0 vector 80
kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
kvm_set_irq: gsi 23 level 1 source 0
kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
kvm_entry: vcpu 0
kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
...

How does this work on hardware? I don't really know but it seems that we
were dealing with similar issues before, see commit 184564efae4d7 ("kvm:
ioapic: conditionally delay irq delivery duringeoi broadcast"). In case EOI
doesn't always cause an *immediate* interrupt re-injection some progress
can be made.

An alternative approach would be to port the above mentioned commit to
QEMU's ioapic implementation. I'm, however, not sure that level-triggered
interrupts is a must to trigger the issue.

HV_TIMER_THROTTLE/HV_TIMER_DELAY are more or less arbitrary. I haven't
looked at SVM yet but their immediate exit implementation does
smp_send_reschedule(), I'm not sure this can cause the above described
lockup.

Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
---
arch/x86/kvm/vmx/vmcs.h | 2 ++
arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++++++-
2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index cb6079f8a227..983dfc60c30c 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -54,6 +54,8 @@ struct loaded_vmcs {
bool launched;
bool nmi_known_unmasked;
bool hv_timer_armed;
+ unsigned long hv_timer_lastrip;
+ int hv_timer_lastrip_cnt;
/* Support for vnmi-less CPUs */
int soft_vnmi_blocked;
ktime_t entry_time;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 617443c1c6d7..8a49ec14dd3a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6321,14 +6321,36 @@ static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val)
vmx->loaded_vmcs->hv_timer_armed = true;
}

+#define HV_TIMER_THROTTLE 100
+#define HV_TIMER_DELAY 1000
+
static void vmx_update_hv_timer(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
u64 tscl;
u32 delta_tsc;

+ /*
+ * Some guests are buggy and may behave in a way that causes KVM to
+ * always request immediate exit (e.g. of a nested guest). At the same
+ * time guest progress may be required to resolve the condition.
+ * Throttling below makes sure we never get stuck completely.
+ */
if (vmx->req_immediate_exit) {
- vmx_arm_hv_timer(vmx, 0);
+ unsigned long rip = kvm_rip_read(vcpu);
+
+ if (vmx->loaded_vmcs->hv_timer_lastrip == rip) {
+ ++vmx->loaded_vmcs->hv_timer_lastrip_cnt;
+ } else {
+ vmx->loaded_vmcs->hv_timer_lastrip_cnt = 0;
+ vmx->loaded_vmcs->hv_timer_lastrip = rip;
+ }
+
+ if (vmx->loaded_vmcs->hv_timer_lastrip_cnt > HV_TIMER_THROTTLE)
+ vmx_arm_hv_timer(vmx, HV_TIMER_DELAY);
+ else
+ vmx_arm_hv_timer(vmx, 0);
+
return;
}

--
2.20.1