Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

From: Sean Christopherson
Date: Fri Nov 03 2023 - 10:33:08 EST


On Thu, Nov 02, 2023, Maxim Levitsky wrote:
> On Wed, 2023-11-01 at 07:16 -0700, Sean Christopherson wrote:
> > On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > > On Thu, 2023-10-26 at 10:24 -0700, Sean Christopherson wrote:
> > > > --
> > > > From: Sean Christopherson <seanjc@xxxxxxxxxx>
> > > > Date: Thu, 26 Oct 2023 10:17:33 -0700
> > > > Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
> > > > __state_perm
> > > >
> > > > Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
> > > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> > > > ---
> > > > arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
> > > > 1 file changed, 11 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > > > index ef6906107c54..73f6bc00d178 100644
> > > > --- a/arch/x86/kernel/fpu/xstate.c
> > > > +++ b/arch/x86/kernel/fpu/xstate.c
> > > > @@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
> > > > if ((permitted & requested) == requested)
> > > > return 0;
> > > >
> > > > - /* Calculate the resulting kernel state size */
> > > > + /*
> > > > + * Calculate the resulting kernel state size. Note, @permitted also
> > > > + * contains supervisor xfeatures even though supervisor are always
> > > > + * permitted for kernel and guest FPUs, and never permitted for user
> > > > + * FPUs.
> > > > + */
> > > > mask = permitted | requested;
> > > > - /* Take supervisor states into account on the host */
> > > > - if (!guest)
> > > > - mask |= xfeatures_mask_supervisor();
> > > > ksize = xstate_calculate_size(mask, compacted);
> > >
> > > This might not work with kernel dynamic features, because
> > > xfeatures_mask_supervisor() will return all supported supervisor features.
> >
> > I don't understand what you mean by "This".
>
> >
> > Somewhat of a side topic, I feel very strongly that we should use "guest only"
> > terminology instead of "dynamic". There is nothing dynamic about whether or not
> > XFEATURE_CET_KERNEL is allowed; there's not even a real "decision" beyond checking
> > wheter or not CET is supported.
>
> > > Therefore at least until we have an actual kernel dynamic feature (a feature
> > > used by the host kernel and not KVM, and which has to be dynamic like AMX),
> > > I suggest that KVM stops using the permission API completely for the guest
> > > FPU state, and just gives all the features it wants to enable right to
> >
> > By "it", I assume you mean userspace?
> >
> > > __fpu_alloc_init_guest_fpstate() (Guest FPU permission API IMHO should be
> > > deprecated and ignored)
> >
> > KVM allocates guest FPU state during KVM_CREATE_VCPU, so not using prctl() would
> > either require KVM to defer allocating guest FPU state until KVM_SET_CPUID{,2},
> > or would require a VM-scoped KVM ioctl() to let userspace opt-in to
> >
> > Allocating guest FPU state during KVM_SET_CPUID{,2} would get messy,
>
> > as KVM allows
> > multiple calls to KVM_SET_CPUID{,2} so long as the vCPU hasn't done KVM_RUN. E.g.
> > KVM would need to support actually resizing guest FPU state, which would be extra
> > complexity without any meaningful benefit.
>
>
> OK, I understand you now. What you claim is that it is legal to do this:
>
> - KVM_SET_XSAVE
> - KVM_SET_CPUID (with AMX enabled)
>
> KVM_SET_CPUID will have to resize the xstate which is already valid.

I was actually talking about

KVM_SET_CPUID2 (with dynamic user feature #1)
KVM_SET_CPUID2 (with dynamic user feature #2)

The second call through __xstate_request_perm() will be done with only user
xfeatures in @permitted and so the kernel will compute the wrong ksize.

> Your patch to fix the __xstate_request_perm() does seem to be correct in a
> sense that it will preserve the kernel fpu components in the fpu permissions.
>
> However note that kernel fpu permissions come from
> 'fpu_kernel_cfg.default_features' which don't include the dynamic kernel
> xfeatures (added a few patches before this one).

CET_KERNEL isn't dynamic! It's guest-only. There are no runtime decisions as to
whether or not CET_KERNEL is allowed. All guest FPU get CET_KERNEL, no kernel FPUs
get CET_KERNEL.

That matters because I am also proposing that we add a dedicated, defined-at-boot
fpu_guest_cfg instead of bolting on a "dynamic", which is what I meant by this:

: Or even better if it doesn't cause weirdness elsewhere, a dedicated
: fpu_guest_cfg. For me at least, a fpu_guest_cfg would make it easier to
: understand what all is going on.

That way, initialization of permissions is simply

fpu->guest_perm = fpu_guest_cfg.default_features;

and there's no need to differentiate between guest and kernel FPUs when reallocating
for dynamic user xfeatures because guest_perm.__state_perm already holds the correct
data.

> Therefore an attempt to resize the xstate to include a kernel dynamic feature by
> __xfd_enable_feature will fail.
>
> If kvm on the other hand includes all the kernel dynamic features in the
> initial allocation of FPU state (not optimal but possible),

This is what I am suggesting.

: There are definitely scenarios where CET will not be exposed to KVM guests, but
: I don't see any reason to make the guest FPU space dynamically sized for CET.
: It's what, 40 bytes?

> then later call to __xstate_request_perm for a userspace dynamic feature
> (which can still happen) will mess the the xstate, because again the
> permission code assumes that only default kernel features were granted the
> permissions.
>
>
> This has to be solved this way or another.
>
> >
> > The only benefit I can think of for a VM-scoped ioctl() is that it would allow a
> > single process to host multiple VMs with different dynamic xfeature requirements.
> > But such a setup is mostly theoretical. Maybe it'll affect the SEV migration
> > helper at some point? But even that isn't guaranteed.
> >
> > So while I agree that ARCH_GET_XCOMP_GUEST_PERM isn't ideal, practically speaking
> > it's sufficient for all current use cases. Unless a concrete use case comes along,
> > deprecating ARCH_GET_XCOMP_GUEST_PERM in favor of a KVM ioctl() would be churn for
> > both the kernel and userspace without any meaningful benefit, or really even any
> > true change in behavior.
>
>
> ARCH_GET_XCOMP_GUEST_PERM/ARCH_SET_XCOMP_GUEST_PERM is not a good API from
> usability POV, because it is redundant.
>
> KVM already has API called KVM_SET_CPUID2, by which the qemu/userspace
> instructs the KVM, how much space to allocate, to support a VM with *this*
> CPUID.
>
> For example if qemu asks for nested SVM/VMX, then kvm will allocate on demand
> state for it (also at least 8K/vCPU btw). The same should apply for AMX -
> Qemu sets AMX xsave bit in CPUID - that permits KVM to allocate the extra
> state when needed.
>
> I don't see why we need an extra and non KVM API for that.

I don't necessarily disagree, but what's done is done. We missed our chance to
propose a different mechanism, and at this point undoing all of that without good
cause is unlikely to benefit anyone. If a use comes along that needs something
"better" than the prctl() API, then I agree it'd be worth revisiting.