Re: [PATCH 1/2] KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled

From: Sean Christopherson
Date: Wed Jan 24 2024 - 10:50:05 EST


On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> Reset vcpu->arch.perf_capabilities to 0 if PDCM is disabled in guest cpuid.
> Without this, there is an issue in live migration. In particular, to
> migrate a VM with no PDCM enabled, VMM on the source is able to retrieve a
> non-zero value by reading the MSR_IA32_PERF_CAPABILITIES. However, VMM on
> the target is unable to set the value. This creates confusions on the user
> side.
>
> Fundamentally, it is because vcpu->arch.perf_capabilities as the cached
> value of MSR_IA32_PERF_CAPABILITIES is incorrect, and there is nothing
> wrong on the kvm_get_msr_common() which just reads
> vcpu->arch.perf_capabilities.
>
> Fix the issue by adding the reset code in kvm_vcpu_after_set_cpuid(), i.e.
> early in VM setup time.
>
> Cc: Aaron Lewis <aaronlewis@xxxxxxxxxx>
> Signed-off-by: Mingwei Zhang <mizhang@xxxxxxxxxx>
> ---
> arch/x86/kvm/cpuid.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index adba49afb5fe..416bee03c42a 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -369,6 +369,9 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
> vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
>
> + /* Reset MSR_IA32_PERF_CAPABILITIES guest value to 0 if PDCM is off. */
> + if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
> + vcpu->arch.perf_capabilities = 0;

No, this is just papering over the underlying bug. KVM shouldn't be stuffing
vcpu->arch.perf_capabilities without explicit writes from host userspace. E.g
KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a
host userspace write to MSR_IA32_PERF_CAPABILITIES. It's unlikely any userspace
actually does something like that, but KVM overwriting guest state is almost
never a good thing.

I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?).
KVM needs to simply not stuff vcpu->arch.perf_capabilities. I believe we are
already fudging around this in our internal kernels, so I don't think there's a
need to carry a hack-a-fix for the destination kernel.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 27e23714e960..fdef9d706d61 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)

kvm_async_pf_hash_reset(vcpu);

- vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap;
kvm_pmu_init(vcpu);

vcpu->arch.pending_external_vector = -1;

> kvm_pmu_refresh(vcpu);
> vcpu->arch.cr4_guest_rsvd_bits =
> __cr4_reserved_bits(guest_cpuid_has, vcpu);
> --
> 2.43.0.429.g432eaa2c6b-goog
>