Re: [PATCH] Move VMEnter and VMExit tracepoints closer to the actual event

From: Sean Christopherson
Date: Thu May 20 2021 - 11:32:42 EST


On Wed, May 19, 2021, Stefano De Venuto wrote:
> The kvm_entry and kvm_exit tracepoints are still quite far from the
> actual VMEnters/VMExits. This means that in a trace we can find host
> events after a kvm_entry event and before a kvm_exit one, as in this
> example:
>
> trace.dat: CPU 0/KVM-4594 [001] 2.167191: kvm_entry:
> trace.dat: CPU 0/KVM-4594 [001] 2.167192: write_msr: 48, value 0
> trace.dat: CPU 0/KVM-4594 [001] 2.167192: rcu_utilization: Start context switch
> trace.dat: CPU 0/KVM-4594 [001] 2.167192: rcu_utilization: End context switch
> trace-tumbleweed.dat: <idle>-0 [000] 2.167196: hrtimer_cancel:
> trace-tumbleweed.dat: <idle>-0 [000] 2.167197: hrtimer_expire_entry:
> trace-tumbleweed.dat: <idle>-0 [000] 2.167201: hrtimer_expire_exit:
> trace-tumbleweed.dat: <idle>-0 [000] 2.167201: hrtimer_start:
> trace.dat: CPU 0/KVM-4594 [001] 2.167203: read_msr: 48, value 0
> trace.dat: CPU 0/KVM-4594 [001] 2.167203: write_msr: 48, value 4
> trace.dat: CPU 0/KVM-4594 [001] 2.167204: kvm_exit:
>
> This patch moves the tracepoints closer to the events, for both Intel
> and AMD, so that a combined host-guest trace will offer a more
> realistic representation of what is really happening, as shown here:
>
> trace.dat: CPU 0/KVM-2553 [000] 2.190290: write_msr: 48, value 0

I'm not sure this is a good thing, as it's not clear to me that invoking tracing
with the guest's SPEC_CTRL loaded is desirable. Maybe it's a non-issue, but it
should be explicitly called out and discussed.

And to some extent, the current behavior is _more_ accurate because it shows that
KVM started its VM-Enter sequence and then the WRMSR occured as part of that
sequence. It is writing the guest's value after all. Ditto for XCR0, XSS, PKRU,
Intel PT, etc...

A more concrete example would be perf; on VMX, if a perf NMI happens after KVM
invokes atomic_switch_perf_msrs() then I absolutely want to see that reflected
in the trace, e.g. to help debug the PEBS mess[*]. If the VM-Enter tracepoint
is moved closer to VM-Enter, that may or may not hold true depending on where the
NMI lands.

On VMX, I think the tracepoint can be moved below the VMWRITEs without much
contention (though doing so is likely a nop), but moving it below
kvm_load_guest_xsave_state() requires a bit more discussion.

I 100% agree that the current behavior can be a bit confusing, but I wonder if
we'd be better off "solving" that problem through documentation.

[*] https://lkml.kernel.org/r/20210209225653.1393771-1-jmattson@xxxxxxxxxx

> trace.dat: CPU 0/KVM-2553 [000] 2.190290: rcu_utilization: Start context switch
> trace.dat: CPU 0/KVM-2553 [000] 2.190290: rcu_utilization: End context switch
> trace.dat: CPU 0/KVM-2553 [000] 2.190290: kvm_entry:
> trace-tumbleweed.dat: <idle>-0 [000] 2.190290: write_msr:
> trace-tumbleweed.dat: <idle>-0 [000] 2.190290: cpu_idle:
> trace.dat: CPU 0/KVM-2553 [000] 2.190291: kvm_exit:
> trace.dat: CPU 0/KVM-2553 [000] 2.190291: read_msr: 48, value 0
> trace.dat: CPU 0/KVM-2553 [000] 2.190291: write_msr: 48, value 4
>
> Signed-off-by: Stefano De Venuto <stefano.devenuto99@xxxxxxxxx>
> Signed-off-by: Dario Faggioli <dfaggioli@xxxxxxxx>
> ---

...

> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 4bceb5ca3a89..33c732101b83 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6661,6 +6661,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> {
> kvm_guest_enter_irqoff();
>
> + trace_kvm_entry(vcpu);
> +
> /* L1D Flush includes CPU buffer clear to mitigate MDS */
> if (static_branch_unlikely(&vmx_l1d_should_flush))
> vmx_l1d_flush(vcpu);
> @@ -6675,6 +6677,9 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>
> vcpu->arch.cr2 = native_read_cr2();
>
> + vmx->exit_reason.full = vmcs_read32(VM_EXIT_REASON);
> + trace_kvm_exit(vmx->exit_reason.full, vcpu, KVM_ISA_VMX);

This is wrong in the 'vmx->fail == true' case.

> +
> kvm_guest_exit_irqoff();
> }
>
> @@ -6693,8 +6698,6 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> if (vmx->emulation_required)
> return EXIT_FASTPATH_NONE;
>
> - trace_kvm_entry(vcpu);
> -
> if (vmx->ple_window_dirty) {
> vmx->ple_window_dirty = false;
> vmcs_write32(PLE_WINDOW, vmx->ple_window);
> @@ -6814,15 +6817,12 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> return EXIT_FASTPATH_NONE;
> }
>
> - vmx->exit_reason.full = vmcs_read32(VM_EXIT_REASON);
> if (unlikely((u16)vmx->exit_reason.basic == EXIT_REASON_MCE_DURING_VMENTRY))
> kvm_machine_check();
>
> if (likely(!vmx->exit_reason.failed_vmentry))
> vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
>
> - trace_kvm_exit(vmx->exit_reason.full, vcpu, KVM_ISA_VMX);
> -
> if (unlikely(vmx->exit_reason.failed_vmentry))
> return EXIT_FASTPATH_NONE;
>
> --
> 2.31.1
>