Re: [PATCH RFC v9 04/51] KVM: x86: Determine shared/private faults using a configurable mask

From: Michael Roth
Date: Wed Jun 21 2023 - 19:01:02 EST


On Tue, Jun 20, 2023 at 02:18:45PM -0700, Isaku Yamahata wrote:
> On Tue, Jun 20, 2023 at 03:28:41PM -0500,
> Michael Roth <michael.roth@xxxxxxx> wrote:
>
> > On Wed, Jun 14, 2023 at 09:47:09AM -0700, Isaku Yamahata wrote:
> > > On Sun, Jun 11, 2023 at 11:25:12PM -0500,
> > > Michael Roth <michael.roth@xxxxxxx> wrote:
> > >
> > > > This will be used to determine whether or not an #NPF should be serviced
> > > > using a normal page vs. a guarded/gmem one.
> > > >
> > > > Signed-off-by: Michael Roth <michael.roth@xxxxxxx>
> > > > ---
> > > > arch/x86/include/asm/kvm_host.h | 7 +++++++
> > > > arch/x86/kvm/mmu/mmu_internal.h | 35 ++++++++++++++++++++++++++++++++-
> > > > 2 files changed, 41 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index b3bd24f2a390..c26f76641121 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1445,6 +1445,13 @@ struct kvm_arch {
> > > > */
> > > > #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> > > > struct kvm_mmu_memory_cache split_desc_cache;
> > > > +
> > > > + /*
> > > > + * When set, used to determine whether a fault should be treated as
> > > > + * private in the context of protected VMs which use a separate gmem
> > > > + * pool to back private guest pages.
> > > > + */
> > > > + u64 mmu_private_fault_mask;
> > > > };
> > > >
> > > > struct kvm_vm_stat {
> > > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > > > index 780b91e1da9f..9b9e75aa43f4 100644
> > > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > > @@ -252,6 +252,39 @@ struct kvm_page_fault {
> > > >
> > > > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> > > >
> > > > +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err)
> > > > +{
> > > > + struct kvm_memory_slot *slot;
> > > > + bool private_fault = false;
> > > > + gfn_t gfn = gpa_to_gfn(gpa);
> > > > +
> > > > + slot = gfn_to_memslot(kvm, gfn);
> > > > + if (!slot) {
> > > > + pr_debug("%s: no slot, GFN: 0x%llx\n", __func__, gfn);
> > > > + goto out;
> > > > + }
> > > > +
> > > > + if (!kvm_slot_can_be_private(slot)) {
> > > > + pr_debug("%s: slot is not private, GFN: 0x%llx\n", __func__, gfn);
> > > > + goto out;
> > > > + }
> > > > +
> > > > + if (kvm->arch.mmu_private_fault_mask) {
> > > > + private_fault = !!(err & kvm->arch.mmu_private_fault_mask);
> > > > + goto out;
> > > > + }
> > >
> > > What's the convention of err? Can we abstract it by introducing a new bit
> > > PFERR_PRIVATE_MASK? The caller sets it based on arch specific value.
> > > the logic will be
> > > .is_private = err & PFERR_PRIVATE_MASK;
> >
> > I'm not sure I understand the question. 'err' is just the page fault flags,
> > and arch.mmu_private_fault_mask is something that can be set on a
> > per-platform basis when running in a mode where shared/private access
> > is recorded in the page fault flags during a #NPF.
> >
> > I'm not sure how we'd keep the handling cross-platform by moving to a macro,
> > since TDX uses a different bit, and we'd want to be able to build a
> > SNP+TDX kernel that could run on either type of hardware.
> >
> > Are you suggesting to reverse that and have err be set in a platform-specific
> > way and then use a common PFERR_PRIVATE_MASK that's software-defined and
> > consistent across platforms? That could work, but existing handling seems
> > to use page fault flags as-is, keeping the hardware-set values, rather than
> > modifying them to pass additional metadata, so it seems like it might
> > make things more confusing to make an exception to that here. Or are
> > there other cases where it's done that way?
>
> I meant the latter, making PFERR_PRIVATE_MASK common software-defined.
>
> I think the SVM fault handler can use hardware value directly by carefully
> defining those PFERR values.
>
> TDX doesn't have architectural bit in error code to indicate the private fault.
> It's coded in faulted address as shared bit. GPA bit 51 or 47.
> PFERR_{USER, WRITE, FETCH, PRESENT} are already software-defined value for VMX
> (and TDX). The fault handler for VMX, handle_ept_violation(), converts
> encoding. For TDX, PFERR_PRIVATE_MASK is just one more software defined bit.
>
> I'm fine with either way, variable or macro. Which do you prefer?
>
> - Define variable mmu_private_fault_mask (global or per struct kvm)
> The module initialization code, hardware_setup(), sets mmu_private_fault_mask.
> - Define the software defined value, PFERR_PRIVATE_MASK.
> The caller of kvm_mmu_page_fault() parses the hardware value and construct
> software defined error_code.
> - any other?
>
>
> > > > +
> > > > + /*
> > > > + * Handling below is for UPM self-tests and guests that treat userspace
> > > > + * as the authority on whether a fault should be private or not.
> > > > + */
> > > > + private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> > >
> > > This code path is sad. One extra slot lookup and xarray look up.
> > > Without mmu lock, the result can change by other vcpu.
> > > Let's find a better way.
> >
> > The intention was to rely on fault->mmu_seq to determine if a
> > KVM_SET_MEMORY_ATTRIBUTES update came in after .private_fault was set so
> > that fault handling could be retried, but that doesn't happen until
> > kvm_faultin_pfn() which is *after* this is logged. So yes, I think there
> > is a race here, and the approach you took in your Misc. series of
> > keeping the kvm_mem_is_private() check inside kvm_faultin_pfn() is more
> > efficient/correct.
> >
> > If we can figure out a way to handle checking the fault flags in a way
> > that works for both TDX/SNP (and KVM self-test use-case) we can
> > consolidate around that.
>
> I can think of the following ways. I think the second option is better because
> we don't need exit bit for error code.
>
> - Introduce software defined error code
> - Add a flags to struct kvm_arch for self-test use-case VM_TYPE_PROTECTED_VM.
> Set it to true for VM_TYPE_PROTECTED_VM case.
> - any other?

Vishal: hoping to get your thoughts here as well from the perspective of
the KVM self-test use-case.

I was thinking that once we set fault->is_private, that sort of
becomes our "software-defined" bit, and what KVM would use from that
point forward to determine whether or not the access should be treated
as a private one or not, and that whatever handler sets
fault->is_private would encapsulate away all the platform-specific
bit-checking needed to do that.

So if we were to straight-forwardly implement that based on how TDX
currently handles checking for the shared bit in GPA, paired with how
SEV-SNP handles checking for private bit in fault flags, it would look
something like:

bool kvm_fault_is_private(kvm, gpa, err)
{
/* SEV-SNP handling */
if (kvm->arch.mmu_private_fault_mask)
return !!(err & arch.mmu_private_fault_mask);

/* TDX handling */
if (kvm->arch.gfn_shared_mask)
return !!(gpa & arch.gfn_shared_mask);

return false;
}

kvm_mmu_do_page_fault(vcpu, gpa, err, ...)
{
struct kvm_page_fault fault = {
...
.is_private = kvm_fault_is_private(vcpu->kvm, gpa, err)
};

...
}

And then arch.mmu_private_fault_mask and arch.gfn_shared_mask would be
set per-KVM-instance, just like they are now with current SNP and TDX
patchsets, since stuff like KVM self-test wouldn't be setting those
masks, so it makes sense to do it per-instance in that regard.

But that still gets a little awkward for the KVM self-test use-case where
.is_private should sort of be ignored in favor of whatever the xarray
reports via kvm_mem_is_private(). In your Misc. series I believe you
handled this by introducing a PFERR_HASATTR_MASK bit so we can determine
whether existing value of fault->is_private should be
ignored/overwritten or not.

So maybe kvm_fault_is_private() needs to return an integer value
instead, like:

enum {
KVM_FAULT_VMM_DEFINED,
KVM_FAULT_SHARED,
KVM_FAULT_PRIVATE,
}

bool kvm_fault_is_private(kvm, gpa, err)
{
/* SEV-SNP handling */
if (kvm->arch.mmu_private_fault_mask)
(err & arch.mmu_private_fault_mask) ? KVM_FAULT_PRIVATE : KVM_FAULT_SHARED

/* TDX handling */
if (kvm->arch.gfn_shared_mask)
(gpa & arch.gfn_shared_mask) ? KVM_FAULT_SHARED : KVM_FAULT_PRIVATE

return KVM_FAULT_VMM_DEFINED;
}

And then down in __kvm_faultin_pfn() we do:

if (fault->is_private == KVM_FAULT_VMM_DEFINED)
fault->is_private = kvm_mem_is_private(vcpu->kvm, fault->gfn);
else if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
return kvm_do_memory_fault_exit(vcpu, fault);

if (fault->is_private)
return kvm_faultin_pfn_private(vcpu, fault);

Maybe kvm_fault_is_private() can be simplified based on what direction
we end up taking WRT ongoing discussions like whether we decide to define
KVM_X86_{SNP,TDX}_VM vm_types in addition to the KVM_X86_PROTECTED_VM
type that the selftests uses, but hoping that for this path, any changes
along that line can be encapsulated away in kvm_fault_is_private() without
any/much further churn at the various call-sites like __kvm_faultin_pfn().

We could even push all the above logic down into the KVM self-tests, but
have:

bool kvm_fault_is_private(kvm, gpa, err) {
return KVM_FAULT_VMM_DEFINED;
}

And that would be enough to run self-tests as standalone series, with
TDX/SNP should filling in kvm_fault_is_private() with their
platform-specific handling.

Does that seem reasonable to you? At least as a starting point.

-Mike

>
> Thanks,
> --
> Isaku Yamahata <isaku.yamahata@xxxxxxxxx>