Re: [PATCH 3/9] KVM: x86: Initialize guest cpu_caps based on guest CPUID

From: Sean Christopherson
Date: Thu Nov 30 2023 - 20:51:35 EST


On Sun, Nov 19, 2023, Maxim Levitsky wrote:
> On Fri, 2023-11-10 at 15:55 -0800, Sean Christopherson wrote:
> > +/*
> > + * This isn't truly "unsafe", but all callers except kvm_cpu_after_set_cpuid()
> > + * should use __cpuid_entry_get_reg(), which provides compile-time validation
> > + * of the input.
> > + */
> > +static u32 cpuid_get_reg_unsafe(struct kvm_cpuid_entry2 *entry, u32 reg)
> > +{
> > + switch (reg) {
> > + case CPUID_EAX:
> > + return entry->eax;
> > + case CPUID_EBX:
> > + return entry->ebx;
> > + case CPUID_ECX:
> > + return entry->ecx;
> > + case CPUID_EDX:
> > + return entry->edx;
> > + default:
> > + WARN_ON_ONCE(1);
> > + return 0;
> > + }
> > +}

...

> > static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > {
> > struct kvm_lapic *apic = vcpu->arch.apic;
> > struct kvm_cpuid_entry2 *best;
> > bool allow_gbpages;
> > + int i;
> >
> > - memset(vcpu->arch.cpu_caps, 0, sizeof(vcpu->arch.cpu_caps));
> > + BUILD_BUG_ON(ARRAY_SIZE(reverse_cpuid) != NR_KVM_CPU_CAPS);
> > +
> > + /*
> > + * Reset guest capabilities to userspace's guest CPUID definition, i.e.
> > + * honor userspace's definition for features that don't require KVM or
> > + * hardware management/support (or that KVM simply doesn't care about).
> > + */
> > + for (i = 0; i < NR_KVM_CPU_CAPS; i++) {
> > + const struct cpuid_reg cpuid = reverse_cpuid[i];
> > +
> > + best = kvm_find_cpuid_entry_index(vcpu, cpuid.function, cpuid.index);
> > + if (best)
> > + vcpu->arch.cpu_caps[i] = cpuid_get_reg_unsafe(best, cpuid.reg);
>
> Why not just use __cpuid_entry_get_reg?
>
> cpuid.reg comes from read/only 'reverse_cpuid' anyway, and in fact
> it seems that all callers of __cpuid_entry_get_reg, take the reg value from
> x86_feature_cpuid() which also takes it from 'reverse_cpuid'.
>
> So if the compiler is smart enough to not complain in these cases, I don't
> see why this case is different.

It's because the input isn't a compile-time constant, and so the BUILD_BUG() in
the default path will fire. All of the compile-time assertions in reverse_cpuid.h
rely on the feature being a constant value, which allows the compiler to optimize
away the dead paths, i.e. turn __cpuid_entry_get_reg()'s switch statement into
simple pointer arithmetic and thus omit the BUILD_BUG() code.

> Also why not to initialize guest_caps = host_caps & userspace_cpuid?
>
> If this was the default we won't need any guest_cpu_cap_restrict and such,
> instead it will just work.

Hrm, I definitely like the idea. Unfortunately, unless we do an audit of all
~120 uses of guest_cpuid_has(), restricting those based on kvm_cpu_caps might
break userspace.

Aside from purging the governed feature nomenclature, the main goal of this series
provide a way to do fast lookups of all known guest CPUID bits without needing to
opt-in on a feature-by-feature basis, including for features that are fully
controlled by userspace.

It's definitely doable, but I'm not all that confident that the end result would
be a net positive, e.g. I believe we would need to special case things like the
feature bits that gate MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD. MOVBE and RDPID
are other features that come to mind, where KVM emulates the feature in software
but it won't be set in kvm_cpu_caps.

Oof, and MONITOR and MWAIT too, as KVM deliberately doesn't advertise those to
userspace.

So yeah, I'm not opposed to trying that route at some point, but I really don't
want to do that in this series as the risk of subtly breaking something is super
high.

> Special code will only be needed in few more complex cases, like forced exposed
> of a feature to a guest due to a virtualization hole.
>
>
> > + else
> > + vcpu->arch.cpu_caps[i] = 0;
> > + }
> >
> > /*
> > * If TDP is enabled, let the guest use GBPAGES if they're supported in
> > @@ -342,8 +380,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > */
> > allow_gbpages = tdp_enabled ? boot_cpu_has(X86_FEATURE_GBPAGES) :
> > guest_cpuid_has(vcpu, X86_FEATURE_GBPAGES);
> > - if (allow_gbpages)
> > - guest_cpu_cap_set(vcpu, X86_FEATURE_GBPAGES);
> > + guest_cpu_cap_change(vcpu, X86_FEATURE_GBPAGES, allow_gbpages);
>
> IMHO the original code was more readable, now I need to look up the
> 'guest_cpu_cap_change()' to understand what is going on.

The change is "necessary". The issue is that with the caps 0-initialied, the
!allow_gbpages could simply do nothing. Now, KVM needs to explicitly clear the
flag, i.e. would need to do:

if (allow_gbpages)
guest_cpu_cap_set(vcpu, X86_FEATURE_GBPAGES);
else
guest_cpu_cap_clear(vcpu, X86_FEATURE_GBPAGES);

I don't much love the name either, but it pairs with cpuid_entry_change() and I
want to keep the kvm_cpu_cap, cpuid_entry, and guest_cpu_cap APIs in sync as far
as the APIs go. The only reason kvm_cpu_cap_change() doesn't exist is because
there aren't any flows that need to toggle a bit.

> > static __always_inline bool guest_cpu_cap_has(struct kvm_vcpu *vcpu,
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 8a99a73b6ee5..5827328e30f1 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -4315,14 +4315,14 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > * XSS on VM-Enter/VM-Exit. Failure to do so would effectively give
> > * the guest read/write access to the host's XSS.
> > */
> > - if (boot_cpu_has(X86_FEATURE_XSAVE) &&
> > - boot_cpu_has(X86_FEATURE_XSAVES) &&
> > - guest_cpuid_has(vcpu, X86_FEATURE_XSAVE))
> > - guest_cpu_cap_set(vcpu, X86_FEATURE_XSAVES);
> > + guest_cpu_cap_change(vcpu, X86_FEATURE_XSAVES,
> > + boot_cpu_has(X86_FEATURE_XSAVE) &&
> > + boot_cpu_has(X86_FEATURE_XSAVES) &&
> > + guest_cpuid_has(vcpu, X86_FEATURE_XSAVE));
>
> In theory this change does change behavior, now the X86_FEATURE_XSAVE will
> be set iff the condition is true, but before it was set *if* the condition was true.

No, before it was set if and only if the condition was true, because in that case
caps were 0-initialized, i.e. this was/is the only way for XSAVE to be set.

> > - guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_NRIPS);
> > - guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_TSCRATEMSR);
> > - guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_LBRV);
> > + guest_cpu_cap_restrict(vcpu, X86_FEATURE_NRIPS);
> > + guest_cpu_cap_restrict(vcpu, X86_FEATURE_TSCRATEMSR);
> > + guest_cpu_cap_restrict(vcpu, X86_FEATURE_LBRV);
>
> One of the main reasons I don't like governed features is this manual list.

To be fair, the manual lists predate the governed features.

> I want to reach the point that one won't need to add anything manually,
> unless there is a good reason to do so, and there are only a few exceptions
> when the guest cap is set, while the host's isn't.

Yeah, agreed.