Re: [PATCH 3/9] KVM: x86: Initialize guest cpu_caps based on guest CPUID

From: Maxim Levitsky
Date: Sun Nov 19 2023 - 12:33:03 EST


On Fri, 2023-11-10 at 15:55 -0800, Sean Christopherson wrote:
> Initialize a vCPU's capabilities based on the guest CPUID provided by
> userspace instead of simply zeroing the entire array. This will allow
> using cpu_caps to query *all* CPUID-based guest capabilities, i.e. will
> allow converting all usage of guest_cpuid_has() to guest_cpu_cap_has().
>
> Zeroing the array was the logical choice when using cpu_caps was opt-in,
> e.g. "unsupported" was generally a safer default, and the whole point of
> governed features is that KVM would need to check host and guest support,
> i.e. making everything unsupported by default didn't require more code.
>
> But requiring KVM to manually "enable" every CPUID-based feature in
> cpu_caps would require an absurd amount of boilerplate code.
>
> Follow existing CPUID/kvm_cpu_caps nomenclature where possible, e.g. for
> the change() and clear() APIs. Replace check_and_set() with restrict() to
> try and capture that KVM is restricting userspace's desired guest feature
> set based on KVM's capabilities.
>
> This is intended to be gigantic nop, i.e. should not have any impact on
> guest or KVM functionality.
>
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> ---
> arch/x86/kvm/cpuid.c | 43 +++++++++++++++++++++++++++++++++++++++---
> arch/x86/kvm/cpuid.h | 25 +++++++++++++++++++++---
> arch/x86/kvm/svm/svm.c | 24 +++++++++++------------
> arch/x86/kvm/vmx/vmx.c | 6 ++++--
> 4 files changed, 78 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 4bf3c2d4dc7c..5cf3d697ecb3 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -321,13 +321,51 @@ static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
> return entry && entry->eax == HYPERV_CPUID_SIGNATURE_EAX;
> }
>
> +/*
> + * This isn't truly "unsafe", but all callers except kvm_cpu_after_set_cpuid()
> + * should use __cpuid_entry_get_reg(), which provides compile-time validation
> + * of the input.
> + */
> +static u32 cpuid_get_reg_unsafe(struct kvm_cpuid_entry2 *entry, u32 reg)
> +{
> + switch (reg) {
> + case CPUID_EAX:
> + return entry->eax;
> + case CPUID_EBX:
> + return entry->ebx;
> + case CPUID_ECX:
> + return entry->ecx;
> + case CPUID_EDX:
> + return entry->edx;
> + default:
> + WARN_ON_ONCE(1);
> + return 0;
> + }
> +}



> +
> static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> {
> struct kvm_lapic *apic = vcpu->arch.apic;
> struct kvm_cpuid_entry2 *best;
> bool allow_gbpages;
> + int i;
>
> - memset(vcpu->arch.cpu_caps, 0, sizeof(vcpu->arch.cpu_caps));
> + BUILD_BUG_ON(ARRAY_SIZE(reverse_cpuid) != NR_KVM_CPU_CAPS);
> +
> + /*
> + * Reset guest capabilities to userspace's guest CPUID definition, i.e.
> + * honor userspace's definition for features that don't require KVM or
> + * hardware management/support (or that KVM simply doesn't care about).
> + */
> + for (i = 0; i < NR_KVM_CPU_CAPS; i++) {
> + const struct cpuid_reg cpuid = reverse_cpuid[i];
> +
> + best = kvm_find_cpuid_entry_index(vcpu, cpuid.function, cpuid.index);
> + if (best)
> + vcpu->arch.cpu_caps[i] = cpuid_get_reg_unsafe(best, cpuid.reg);

Why not just use __cpuid_entry_get_reg?

cpuid.reg comes from read/only 'reverse_cpuid' anyway, and in fact
it seems that all callers of __cpuid_entry_get_reg, take the reg value from
x86_feature_cpuid() which also takes it from 'reverse_cpuid'.

So if the compiler is smart enough to not complain in these cases, I don't see why this case
is different.


Also why not to initialize guest_caps = host_caps & userspace_cpuid?

If this was the default we won't need any guest_cpu_cap_restrict and such,
instead it will just work.

Special code will only be needed in few more complex cases, like forced exposed
of a feature to a guest due to a virtualization hole.


> + else
> + vcpu->arch.cpu_caps[i] = 0;
> + }
>
> /*
> * If TDP is enabled, let the guest use GBPAGES if they're supported in
> @@ -342,8 +380,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> */
> allow_gbpages = tdp_enabled ? boot_cpu_has(X86_FEATURE_GBPAGES) :
> guest_cpuid_has(vcpu, X86_FEATURE_GBPAGES);
> - if (allow_gbpages)
> - guest_cpu_cap_set(vcpu, X86_FEATURE_GBPAGES);
> + guest_cpu_cap_change(vcpu, X86_FEATURE_GBPAGES, allow_gbpages);

IMHO the original code was more readable, now I need to look up the 'guest_cpu_cap_change()'
to understand what is going on.

>
> best = kvm_find_cpuid_entry(vcpu, 1);
> if (best && apic) {
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index 9f18c4395b71..1707ef10b269 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -263,11 +263,30 @@ static __always_inline void guest_cpu_cap_set(struct kvm_vcpu *vcpu,
> vcpu->arch.cpu_caps[x86_leaf] |= __feature_bit(x86_feature);
> }
>
> -static __always_inline void guest_cpu_cap_check_and_set(struct kvm_vcpu *vcpu,
> - unsigned int x86_feature)
> +static __always_inline void guest_cpu_cap_clear(struct kvm_vcpu *vcpu,
> + unsigned int x86_feature)
> {
> - if (kvm_cpu_cap_has(x86_feature) && guest_cpuid_has(vcpu, x86_feature))
> + unsigned int x86_leaf = __feature_leaf(x86_feature);
> +
> + reverse_cpuid_check(x86_leaf);
> + vcpu->arch.cpu_caps[x86_leaf] &= ~__feature_bit(x86_feature);
> +}
> +
> +static __always_inline void guest_cpu_cap_change(struct kvm_vcpu *vcpu,
> + unsigned int x86_feature,
> + bool guest_has_cap)
> +{
> + if (guest_has_cap)
> guest_cpu_cap_set(vcpu, x86_feature);
> + else
> + guest_cpu_cap_clear(vcpu, x86_feature);
> +}

Let's not have this function, it's just not worth it IMHO.

> +
> +static __always_inline void guest_cpu_cap_restrict(struct kvm_vcpu *vcpu,
> + unsigned int x86_feature)
> +{
> + if (!kvm_cpu_cap_has(x86_feature))
> + guest_cpu_cap_clear(vcpu, x86_feature);
> }

The purpose of this function is also very hard to decipher.

If we initialize guest_caps = host_caps & guest_cpuid then we won't
need this function.

>
> static __always_inline bool guest_cpu_cap_has(struct kvm_vcpu *vcpu,
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 8a99a73b6ee5..5827328e30f1 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4315,14 +4315,14 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> * XSS on VM-Enter/VM-Exit. Failure to do so would effectively give
> * the guest read/write access to the host's XSS.
> */
> - if (boot_cpu_has(X86_FEATURE_XSAVE) &&
> - boot_cpu_has(X86_FEATURE_XSAVES) &&
> - guest_cpuid_has(vcpu, X86_FEATURE_XSAVE))
> - guest_cpu_cap_set(vcpu, X86_FEATURE_XSAVES);
> + guest_cpu_cap_change(vcpu, X86_FEATURE_XSAVES,
> + boot_cpu_has(X86_FEATURE_XSAVE) &&
> + boot_cpu_has(X86_FEATURE_XSAVES) &&
> + guest_cpuid_has(vcpu, X86_FEATURE_XSAVE));

In theory this change does change behavior, now the X86_FEATURE_XSAVE will
be set iff the condition is true, but before it was set *if* the condition was true.

>
> - guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_NRIPS);
> - guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_TSCRATEMSR);
> - guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_LBRV);
> + guest_cpu_cap_restrict(vcpu, X86_FEATURE_NRIPS);
> + guest_cpu_cap_restrict(vcpu, X86_FEATURE_TSCRATEMSR);
> + guest_cpu_cap_restrict(vcpu, X86_FEATURE_LBRV);

One of the main reasons I don't like governed features is this manual list.
I want to reach the point that one won't need to add anything manually, unless there
is a good reason to do so, and there are only a few exceptions when the guest cap is set,
while the host's isn't.

>
> /*
> * Intercept VMLOAD if the vCPU mode is Intel in order to emulate that
> @@ -4330,12 +4330,12 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> * SVM on Intel is bonkers and extremely unlikely to work).
> */
> if (!guest_cpuid_is_intel(vcpu))
> - guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_V_VMSAVE_VMLOAD);
> + guest_cpu_cap_restrict(vcpu, X86_FEATURE_V_VMSAVE_VMLOAD);
>
> - guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_PAUSEFILTER);
> - guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_PFTHRESHOLD);
> - guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_VGIF);
> - guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_VNMI);
> + guest_cpu_cap_restrict(vcpu, X86_FEATURE_PAUSEFILTER);
> + guest_cpu_cap_restrict(vcpu, X86_FEATURE_PFTHRESHOLD);
> + guest_cpu_cap_restrict(vcpu, X86_FEATURE_VGIF);
> + guest_cpu_cap_restrict(vcpu, X86_FEATURE_VNMI);
>
> svm_recalc_instruction_intercepts(vcpu, svm);
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 6328f0d47c64..5a056ad1ae55 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7757,9 +7757,11 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> */
> if (boot_cpu_has(X86_FEATURE_XSAVE) &&
> guest_cpuid_has(vcpu, X86_FEATURE_XSAVE))
> - guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_XSAVES);
> + guest_cpu_cap_restrict(vcpu, X86_FEATURE_XSAVES);
> + else
> + guest_cpu_cap_clear(vcpu, X86_FEATURE_XSAVES);
>
> - guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_VMX);
> + guest_cpu_cap_restrict(vcpu, X86_FEATURE_VMX);
>
> vmx_setup_uret_msrs(vmx);
>


Best regards,
Maxim Levitsky