Re: [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory

From: Michael Roth
Date: Tue Aug 29 2023 - 09:28:44 EST


On Fri, Aug 25, 2023 at 07:59:41PM -0500, Michael Roth wrote:
> On Fri, Aug 18, 2023 at 03:27:47PM -0700, Sean Christopherson wrote:
> > This seems like a bug in the SNP code. (a) why would KVM/SNP PSMASH in that
> > scenario and (b) why can't it zap/split the NPT before PSMASH?
>
> a) A PSMASH will be needed at some point, since, as detailed above, the 4K
> NPT mapping requires the RMP entries for the pages it maps to be
> limited to 4K RMP entries, but...
> b) What would happen normally[1] is the guest would issue PVALIDATE to
> *rescind* the validated status of that 4K GPA before issuing the GHCB
> request to convert it to shared. This would cause an
> #NPF(RMP+SIZE_MISMATCH) and handle_rmp_page_fault() would PSMASH the RMP
> entry so the PVALIDATE can succeed.
>
> So KVM doesn't even really have the option of deferring the PSMASH, it
> happens before the SET_MEMORY_ATTRIBUTES is even issued to zap the 2MB
> NPT mapping and switch the 2M range to 'mixed'. Because of that, we also
> need a hook in the KVM MMU code to clamp the max mapping level based
> on RMP entry size. Currently the kvm_gmem_prepare() in this patch
> doubles for handling that clamping, so we would still need a similar
> hook for that if we move the RMP initialization stuff to allocation
> time.
>
> [1] This handling is recommended for 'well-behaved' guests according to
> GHCB, but I don't see it documented as a hard requirement anywhere, so there
> is a possibility that that we have to deal with a guest that doesn't do this.
> What would happen then is the PVALIDATE wouldn't trigger the #NPF(RMP+SIZEM),
> and instead the SET_MEMORY_ATTRIBUTES would zap the 2MB mapping, install
> 4K entries on next #NPF(NOT_PRESENT), and at *that* point we would get
> an #NPF(RMP) without the SIZEM bit set, due to the behavior described in
> the beginning of this email.
>
> handle_rmp_page_fault() can do the corresponding PSMASH to deal with that,
> but it is a little unfortunate since we can't differentiate that case from a
> spurious/unexpected RMP faults, so would need to attempt a PSMASH in all
> cases, sometimes failing.

The spurious case here is when the guest is accessing a private page
that's just been PSMASH'd by another thread. I thought these might still
occur before the PSMASH has completed so we'd still potentially see the
page-size bit set in the RMP entry, but the RMP faults only happen after
the PSMASH has finished, so the spurious cases can be filtered out by
just checking if the page-size bit is set before attempting a PSMASH.

>
> gmem itself could also trigger this case if the lpage_info_slot() tracking
> ever became more granular than what the guest was expected (which I don't
> think would happen normally, but I may have hit one case where it does, but
> haven't had a chance to debug if that's on the lpage_info_slot() side or
> something else on the SNP end.

Turns out that was for a case where the shared/private attributes for the
2M range really were mixed at the time of access. In this case it is
when OVMF converts some shared memory to private during early boot: the
end address is not 2MB-aligned, so it is in a mixed region, and the NPT
mapping is 4K, but the RMP entry is initialized as 2MB. In this case the
PVALIDATE and NPT agree on the 4K mapping size so the SIZEM bit isn't set,
just the #NPF(RMP).

So we need to be able to deal with that even for 'well-behaved' guests.
With RMP-init-during-mapping-time approach I had some checks that avoided
creating the 2MB RMP entry in this mixed case which is why I didn't need
handling for this previously. But it's just one extra #NPF(RMP) and can
be handled cleanly since it can be distinguished from spurious cases.

-Mike