Re: VMs freezing when host is running 4.14

From: Liran Alon
Date: Thu Nov 23 2017 - 11:18:22 EST




On 23/11/17 17:59, Radim KrÄmÃÅ wrote:
2017-11-23 16:20+0100, Marc Haber:
On Wed, Nov 22, 2017 at 05:43:13PM +0100, Radim KrÄmÃÅ wrote:
2017-11-22 16:52+0100, Marc Haber:
On Wed, Nov 22, 2017 at 04:04:42PM +0100, çéæ wrote:
So all guest kernels are 4.14, or also other older kernel?

Guest kernels are also 4.14, but the issue disappears when the host is
downgraded to an older kernel. I therefore reckoned that the guest
kernel doesn't matter, but that was before I saw the trace in the log.

The two most suspicious patches since 4.13 (which I assume works) are

664f8e26b00c ("KVM: X86: Fix loss of exception which has not yet been
injected")

That one does not revert cleanly, the line in questions seems to have
been removed a bit later.

Reject is:
141 [24/5001]mh@fan:~/linux/git/linux ((v4.14.1) %) $ cat arch/x86/kvm/vmx.c.rej--- arch/x86/kvm/vmx.c
+++ arch/x86/kvm/vmx.c
@@ -2516,7 +2516,7 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu)
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned nr = vcpu->arch.exception.nr;
bool has_error_code = vcpu->arch.exception.has_error_code;
- bool reinject = vcpu->arch.exception.injected;
+ bool reinject = vcpu->arch.exception.reinject;
u32 error_code = vcpu->arch.exception.error_code;
u32 intr_info = nr | INTR_INFO_VALID_MASK;

This line one can be deleted as reinject isn't used in the function.

Btw. there have been already many fixes from Liran Alon for that patch
and your case could be the one adressed in
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kvm_msg159158.html&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=206jU1rQdk3xs1DYWbQPz1gR7Iim02XOjwn458rwgIo&s=fz1JeZiSQBwqYpkmeX8OJukyC4M8BeXSuIOKwuVaeHg&e=

The patch is incorrect, but you might be able to see only its benefits.

Actually I would first attempt to check this patch of mine:
https://www.spinics.net/lists/kvm/msg159062.html
It fixes a bug of a L2 exception accidentally being delivered into L1.

Regards,
-Liran

and

9a6e7c39810e ("KVM: async_pf: Fix #DF due to inject "Page not Present"
and "Page Ready" exceptions simultaneously")

please try reverting them to see if it helps,

That one reverted cleanly. I am now running the new kernel on the
affected machine, and I think that a second machine has joined the
market of being affected.

That one had much lower chances of being the culprit.

Would this matter on the host only or on the guests as well?

Only on the host.

Thanks.