RE: [PATCH 3/5] x86/mm: Mark CoCo VM pages not present while changing encrypted state

From: Michael Kelley (LINUX)
Date: Mon Oct 16 2023 - 20:35:37 EST


From: Michael Kelley (LINUX) <mikelley@xxxxxxxxxxxxx> Sent: Monday, October 2, 2023 1:43 PM
>
> From: Tom Lendacky <thomas.lendacky@xxxxxxx> Sent: Monday, October 2, 2023
> 11:59 AM
> >
> > On 10/2/23 11:35, Tom Lendacky wrote:
> > > On 9/29/23 13:19, Michael Kelley wrote:
> > >> In a CoCo VM when a page transitions from encrypted to decrypted, or vice
> > >> versa, attributes in the PTE must be updated *and* the hypervisor must
> > >> be notified of the change. Because there are two separate steps, there's
> > >> a window where the settings are inconsistent.  Normally the code that
> > >> initiates the transition (via set_memory_decrypted() or
> > >> set_memory_encrypted()) ensures that the memory is not being accessed
> > >> during a transition, so the window of inconsistency is not a problem.
> > >> However, the load_unaligned_zeropad() function can read arbitrary memory
> > >> pages at arbitrary times, which could access a transitioning page during
> > >> the window.  In such a case, CoCo VM specific exceptions are taken
> > >> (depending on the CoCo architecture in use).  Current code in those
> > >> exception handlers recovers and does "fixup" on the result returned by
> > >> load_unaligned_zeropad().  Unfortunately, this exception handling can't
> > >> work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode).
> > >> The exceptions need to be forwarded from the paravisor to the Linux
> > >> guest, but there are no architectural specs for how to do that.
> > >>
> > >> Fortunately, there's a simpler way to solve the problem by changing
> > >> the core transition code in __set_memory_enc_pgtable() to do the
> > >> following:
> > >>
> > >> 1.  Remove aliasing mappings
> > >> 2.  Flush the data cache if needed
> > >> 3.  Remove the PRESENT bit from the PTEs of all transitioning pages
> > >> 4.  Set/clear the encryption attribute as appropriate
> > >> 5.  Flush the TLB so the changed encryption attribute isn't visible
> > >> 6.  Notify the hypervisor of the encryption status change
> > >
> > > Not sure why I didn't notice this before, but I will need to test this to
> > > be certain. As part of this notification, the SNP support will issue a
> > > PVALIDATE instruction (to either validate or rescind validation to the
> > > page). PVALIDATE takes a virtual address. If the PRESENT bit has been
> > > removed, the PVALIDATE instruction will take a #PF (see comments below).
> >
> > Yes, this series results in a #PF booting an SNP guest:
>
> Bummer :-( Interestingly, an SNP guest on Hyper-V with a paravisor
> works, presumably because the paravisor is doing the PVALIDATE with
> a different guest virtual address for the physical page. TDX operates
> on physical addresses, and I've tested that it works.
>
> The spec for PVALIDATE says it performs the same segmentation
> and paging checks as a 1-byte read, so indeed, the #PF is expected.
> But in the spec, the pseudo-code for PVALIDATE uses the GUEST_VA
> only to derive the GUEST_PA and the SYSTEM_PA. The GUEST_VA isn't
> otherwise relevant, so any GUEST_VA that validly maps to the GUEST_PA
> would work, as long as we can be assured that load_unaligned_zeropad()
> won't touch that GUEST_VA.
>
> Let me think about if there's a not-too-hacky way to make this work
> with some temporary GVA.
>

It seems like there are two possible approaches to make this work:

1. Create a temporary virtual mapping in vmalloc space and pass
that virtual address to PVALIDATE. (But only do this when PVALIDATE
is being used for private <-> shared transitions, and not for memory
acceptance.) The temporary mapping is updated with each invocation
of PVALIDATE. To make this work, the temp virtual addr must be aligned
on a 2 Meg boundary and must have a guard page preceding it so that
load_unaligned_zeropad() can't stumble into the temporary mapping.
I've wrestled with a few approaches to coding this over the past two
weeks, and have something that's not too bad. This approach
certainly takes some additional CPU cycles. I've tested doing the
temp mapping in the context of a Hyper-V vTOM VM, but don't see
any measurable impact on boot time, even when converting a 1 Gbyte
swiotlb space from private to shared. I'm setting up now to test
in a regular SNP VM where snp_enc_status_change_finish() is used,
to have end-to-end confirmation that it really does work.

2. A completely different approach is for __set_memory_enc_pgtable()
to clear and restore the PRESENT bit only when REFLECT_VC is set in
the MSR_AMD64_SEV, and the equivalent on TDX. This is the case that's
problematic for doing the load_unaligned_zeropad() fix up in the SNP
#VC or TDX #VE exception handler. I think you said some additional work
was needed for the fixup to be done properly in the SNP #VC case, so that
would have to be done. The cleared PRESENT bit is handled by the
paravisor because the paravisor already has a virtual mapping to pass
to PVALIDATE.

Any thoughts? I'll try to get the code for #1 posted in the next few
days, so you can judge the level of additional complexity to manage
the temp virtual mapping.

Michael