Re: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared

From: kirill.shutemov@xxxxxxxxxxxxxxx
Date: Mon Aug 28 2023 - 12:13:52 EST


On Mon, Aug 28, 2023 at 02:22:11PM +0000, Michael Kelley (LINUX) wrote:
> From: Michael Kelley (LINUX) <mikelley@xxxxxxxxxxxxx> Sent: Tuesday, August 15, 2023 7:54 PM
> >
> > From: kirill.shutemov@xxxxxxxxxxxxxxx <kirill.shutemov@xxxxxxxxxxxxxxx> Sent: Sunday,
> > August 6, 2023 3:20 PM
> > >
> > > On Thu, Jul 06, 2023 at 09:41:59AM -0700, Michael Kelley wrote:
> > > > In a CoCo VM when a page transitions from private to shared, or vice
> > > > versa, attributes in the PTE must be updated *and* the hypervisor must
> > > > be notified of the change. Because there are two separate steps, there's
> > > > a window where the settings are inconsistent. Normally the code that
> > > > initiates the transition (via set_memory_decrypted() or
> > > > set_memory_encrypted()) ensures that the memory is not being accessed
> > > > during a transition, so the window of inconsistency is not a problem.
> > > > However, the load_unaligned_zeropad() function can read arbitrary memory
> > > > pages at arbitrary times, which could access a transitioning page during
> > > > the window. In such a case, CoCo VM specific exceptions are taken
> > > > (depending on the CoCo architecture in use). Current code in those
> > > > exception handlers recovers and does "fixup" on the result returned by
> > > > load_unaligned_zeropad(). Unfortunately, this exception handling and
> > > > fixup code is tricky and somewhat fragile. At the moment, it is
> > > > broken for both TDX and SEV-SNP.
> > >
> >
> > Thanks for looking at this. I'm finally catching up after being out on
> > vacation for a week.
> >
> > > I believe it is not fixed for TDX. Is it still a problem for SEV-SNP?
> >
> > I presume you meant "now fixed for TDX", which I agree with. Tom
> > Lendacky has indicated that there's still a problem with SEV-SNP. He
> > could fix that problem, but this approach of marking the pages
> > invalid obviates the need for Tom's fix.
> >
> > >
> > > > There's also a problem with the current code in paravisor scenarios:
> > > > TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> > > > to be forwarded from the paravisor to the Linux guest, but there
> > > > are no architectural specs for how to do that.
> >
> > The TD Partitioning case (and the similar SEV-SNP vTOM mode case) is
> > what doesn't have a solution. To elaborate, with TD Partitioning, #VE
> > is sent to the containing VM, not the main Linux guest VM. For
> > everything except an EPT violation, the containing VM can handle
> > the exception on behalf of the main Linux guest by doing the
> > appropriate emulation. But for an EPT violation, the containing
> > VM can only terminate the guest. It doesn't have sufficient context
> > to handle a "valid" #VE with EPT violation generated due to
> > load_unaligned_zeropad(). My proposed scheme of marking the
> > pages invalid avoids generating those #VEs and lets TD Partitioning
> > (and similarly for SEV-SNP vTOM) work as intended with a paravisor.
> >
> > > >
> > > > To avoid these complexities of the CoCo exception handlers, change
> > > > the core transition code in __set_memory_enc_pgtable() to do the
> > > > following:
> > > >
> > > > 1. Remove aliasing mappings
> > > > 2. Remove the PRESENT bit from the PTEs of all transitioning pages
> > > > 3. Flush the TLB globally
> > > > 4. Flush the data cache if needed
> > > > 5. Set/clear the encryption attribute as appropriate
> > > > 6. Notify the hypervisor of the page status change
> > > > 7. Add back the PRESENT bit
> > >
> > > Okay, looks safe.
> > >
> > > > With this approach, load_unaligned_zeropad() just takes its normal
> > > > page-fault-based fixup path if it touches a page that is transitioning.
> > > > As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> > > > are completely decoupled. CoCo VM page transitions can proceed
> > > > without needing to handle architecture-specific exceptions and fix
> > > > things up. This decoupling reduces the complexity due to separate
> > > > TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> > > > introduce new capabilities in future versions of the TDX and SEV-SNP
> > > > architectures. Paravisor scenarios work properly without needing
> > > > to forward exceptions.
> > > >
> > > > This approach may make __set_memory_enc_pgtable() slightly slower
> > > > because of touching the PTEs three times instead of just once. But
> > > > the run time of this function is already dominated by the hypercall
> > > > and the need to flush the TLB at least once and maybe twice. In any
> > > > case, this function is only used for CoCo VM page transitions, and
> > > > is already unsuitable for hot paths.
> > > >
> > > > The architecture specific callback function for notifying the
> > > > hypervisor typically must translate guest kernel virtual addresses
> > > > into guest physical addresses to pass to the hypervisor. Because
> > > > the PTEs are invalid at the time of callback, the code for doing the
> > > > translation needs updating. virt_to_phys() or equivalent continues
> > > > to work for direct map addresses. But vmalloc addresses cannot use
> > > > vmalloc_to_page() because that function requires the leaf PTE to be
> > > > valid. Instead, slow_virt_to_phys() must be used. Both functions
> > > > manually walk the page table hierarchy, so performance is the same.
> > >
> > > Uhmm.. But why do we expected slow_virt_to_phys() to work on non-present
> > > page table entries? It seems accident for me that it works now. Somebody
> > > forgot pte_present() check.
> >
> > I didn't research the history of slow_virt_to_phys(), but I'll do so.
> >
>
> The history of slow_virt_to_phys() doesn't show any discussion of
> whether it is expected to work for non-present PTEs. However, the
> page table walking is done by lookup_address(), which explicitly
> *does* work for non-present PTEs. For example, lookup_address()
> is used in cpa_flush() to find a PTE. cpa_flush() then checks to see if
> the PTE is present.

Which is totally fine thing to do. Present bit is the only info you can
always rely to be valid for non-present PTE.

But it is nitpicking on my side. Here you control lifecycle of the PTE, so
you know that PFN will stay valid for the PTE.

I guess the right thing to do is to use lookup_address() and get pfn from
the PTE with the comment why it is valid.

--
Kiryl Shutsemau / Kirill A. Shutemov