RE: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving between private and shared

From: Michael Kelley (LINUX)
Date: Tue Aug 29 2023 - 23:33:51 EST


From: Edgecombe, Rick P <rick.p.edgecombe@xxxxxxxxx> Sent: Tuesday, August 29, 2023 5:03 PM
>
> On Thu, 2023-07-06 at 09:41 -0700, Michael Kelley wrote:
> > In a CoCo VM when a page transitions from private to shared, or vice
> > versa, attributes in the PTE must be updated *and* the hypervisor must
> > be notified of the change. Because there are two separate steps, there's
> > a window where the settings are inconsistent.  Normally the code that
> > initiates the transition (via set_memory_decrypted() or
> > set_memory_encrypted()) ensures that the memory is not being accessed
> > during a transition, so the window of inconsistency is not a problem.
> > However, the load_unaligned_zeropad() function can read arbitrary memory
> > pages at arbitrary times, which could access a transitioning page during
> > the window.  In such a case, CoCo VM specific exceptions are taken
> > (depending on the CoCo architecture in use).  Current code in those
> > exception handlers recovers and does "fixup" on the result returned by
> > load_unaligned_zeropad().  Unfortunately, this exception handling and
> > fixup code is tricky and somewhat fragile.  At the moment, it is
> > broken for both TDX and SEV-SNP.
> >
> > There's also a problem with the current code in paravisor scenarios:
> > TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> > to be forwarded from the paravisor to the Linux guest, but there
> > are no architectural specs for how to do that.
> >
> > To avoid these complexities of the CoCo exception handlers, change
> > the core transition code in __set_memory_enc_pgtable() to do the
> > following:
> >
> > 1.  Remove aliasing mappings
> > 2.  Remove the PRESENT bit from the PTEs of all transitioning pages
> > 3.  Flush the TLB globally
> > 4.  Flush the data cache if needed
> > 5.  Set/clear the encryption attribute as appropriate
> > 6.  Notify the hypervisor of the page status change
> > 7.  Add back the PRESENT bit
> >
> > With this approach, load_unaligned_zeropad() just takes its normal
> > page-fault-based fixup path if it touches a page that is transitioning.
> > As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> > are completely decoupled.  CoCo VM page transitions can proceed
> > without needing to handle architecture-specific exceptions and fix
> > things up. This decoupling reduces the complexity due to separate
> > TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> > introduce new capabilities in future versions of the TDX and SEV-SNP
> > architectures. Paravisor scenarios work properly without needing
> > to forward exceptions.
> >
> > This approach may make __set_memory_enc_pgtable() slightly slower
> > because of touching the PTEs three times instead of just once. But
> > the run time of this function is already dominated by the hypercall
> > and the need to flush the TLB at least once and maybe twice. In any
> > case, this function is only used for CoCo VM page transitions, and
> > is already unsuitable for hot paths.
>
> Excluding vm_unmap_aliases(), and just looking at the TLB flushes, it
> kind of looks like this:
> 1. Clear present
> 2. TLB flush
> 3. Set C bit

This step is either "set C bit" or "clear C bit", depending on whether
the transition is from shared->private, or private->shared.

> 4. Set Present bit
> 5. TLB flush

I was originally thinking that set_memory_p() would be smart
enough to realize that just setting the PRESENT bit doesn't
require flushing the TLB. But looking at the code now, that's
wrong. __change_page_attr() will set the CPA_FLUSHTLB flag
even when the only change is to set the PRESENT bit.
change_page_attr_set_clr() will then act on the CPA_FLUSHTLB
flag and do the flush.

>
> But if you instead did:
> 1. Clear Present and set C bit

Interesting idea. As noted above, it could be either
setting or clearing the C bit. Currently there are some things
done before the C bit set/clear, such as flushing the data
cache on TDX. I don't fully understand that requirement, so
I'm not sure of its actual sequencing needs. Hopefully
someone with TDX expertise can clarify. The current
code also allows callbacks pre and post set/clear of the
C bit, though I was considering simplifying to only
having a post callback.

> 2. TLB flush
> 3. Set Present bit (no flush)

I'll look at whether there's a way to prevent the flush
without creating an alternate version of
change_page_attr_set_clr(). Maybe set_memory_p()
could pass in a new CPA_NO_FLUSHTLB flag that
overrides CPA_FLUSHTLB.

In the bigger picture, the performance of these
private<->shared transitions is dominated by the
hypercalls to sync things with the hypervisor. The Linux
manipulation of the PTEs is a second order cost, though
the TLB flushes are more expensive. It might be worth
some effort to avoid the extra TLB flush.

>
> Then you could still have only 1 TLB flush and 2 operations instead of
> 3. Otherwise it's the same load_unaligned_zeropad() benefits you are
> looking for I think. But I'm not very educated on the private<->shared
> conversion HW rules though, so maybe not.

Eliminating the TLB flush after setting the PRESENT bit seems
worthwhile. I'm less sure about combing the clearing of
PRESENT and the C bit set/clear. We might want to keep
both the pre and post callbacks for future flexibility in how
the hypervisor in invoked. In that case the pre callback
needs to be done between clearing PRESENT and the C
bit set/clear operation, and so they would need to stay
separate. As I said previously, the manipulation of the
PTEs is not a significant cost in the big picture.

>
> >
> > The architecture specific callback function for notifying the
> > hypervisor typically must translate guest kernel virtual addresses
> > into guest physical addresses to pass to the hypervisor.  Because
> > the PTEs are invalid at the time of callback, the code for doing the
> > translation needs updating.  virt_to_phys() or equivalent continues
> > to work for direct map addresses.  But vmalloc addresses cannot use
> > vmalloc_to_page() because that function requires the leaf PTE to be
> > valid. Instead, slow_virt_to_phys() must be used. Both functions
> > manually walk the page table hierarchy, so performance is the same.
>
> Just curious. Are vmalloc addresses supported here? It looks like in
> SEV, but not TDX.

Dexuan Cui has a separate patch to support vmalloc address in TDX.
See https://lore.kernel.org/lkml/20230811214826.9609-3-decui@xxxxxxxxxxxxx/

Thanks for looking at this ...

Michael