Re: [RFC V1 0/5] x86: CVMs: Align memory conversions to 2M granularity

From: Vishal Annapurve
Date: Tue Jan 30 2024 - 11:43:21 EST


On Fri, Jan 12, 2024 at 11:22 AM Vishal Annapurve <vannapurve@xxxxxxxxxx> wrote:
>
> Goal of this series is aligning memory conversion requests from CVMs to
> huge page sizes to allow better host side management of guest memory and
> optimized page table walks.
>
> This patch series is partially tested and needs more work, I am seeking
> feedback from wider community before making further progress.
>
> Background
> =====================
> Confidential VMs(CVMs) support two types of guest memory ranges:
> 1) Private Memory: Intended to be consumed/modified only by the CVM.
> 2) Shared Memory: visible to both guest/host components, used for
> non-trusted IO.
>
> Guest memfd [1] support is set to be merged upstream to handle guest private
> memory isolation from host usersapace. Guest memfd approach allows following
> setup:
> * private memory backed using the guest memfd file which is not accessible
> from host userspace.
> * Shared memory backed by tmpfs/hugetlbfs files that are accessible from
> host userspace.
>
> Userspace VMM needs to register two backing stores for all of the guest
> memory ranges:
> * HVA for shared memory
> * Guest memfd ranges for private memory
>
> KVM keeps track of shared/private guest memory ranges that can be updated at
> runtime using IOCTLs. This allows KVM to back the guest memory using either HVA
> (shared) or guest memfd file offsets (private) based on the attributes of the
> guest memory ranges.
>
> In this setup, there is possibility of "double allocation" i.e. scenarios where
> both shared and private memory backing stores mapped to the same guest memory
> ranges have memory allocated.
>
> Guest issues an hypercall to convert the memory types which is forwarded by KVM
> to the host userspace.
> Userspace VMM is supposed to handle conversion as follows:
> 1) Private to shared conversion:
> * Update guest memory attributes for the range to be shared using KVM
> supported IOCTLs.
> - While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
> to the guest memory being converted.
> * Unback the guest memfd range.
> 2) Shared to private conversion:
> * Update guest memory attributes for the range to be private using KVM
> supported IOCTLs.
> - While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
> to the guest memory being converted.
> * Unback the shared memory file.
>
> Note that unbacking needs to be done for both kinds of conversions in order to
> avoid double allocation.
>
> Problem
> =====================
> CVMs can convert memory between these two types at 4K granularity. Conversion
> done at 4K granularity causes issues when using guest memfd support
> with hugetlb/Hugepage backed guest private memory:
> 1) Hugetlb fs doesn't allow freeing subpage ranges when punching holes,
> causing all the private to shared memory conversions to result in double
> allocation.
> 2) Even if a new fs is implemented for guest memfd that allows splitting
> hugepages, punching holes at 4K will cause:
> - loss of vmemmmap optimization [2]
> - more memory for EPT/NPT entries and extra pagetable walks for guest
> side accesses.
> - Shared memory mappings to consume more host pagetable entries and
> extra pagetalble walks for host side access.
> - Higher number of conversions with additional overhead of VM exits
> serviced by host userspace.
>
> Memory conversion scenarios in the guest that are of major concern:
> - SWIOTLB area conversion early during boot.
> * dma_map_* API invocations for CVMs result in using bounce buffers
> from SWIOTLB region which is already marked as shared.
> - Device drivers allocating memory using dma_alloc_* APIs at runtime
> that bypass SWIOTLB.
>
> Proposal
> =====================
> To counter above issues, this series proposes following:
> 1) Use boot time allocated SWIOTLB pools for all DMA memory allocated
> using dma_alloc_* APIs.
> 2) Increase memory allocated at boot for SWIOTLB from 6% to 8% for CVMs.
> 3) Enable dynamic SWIOTLB [4] by default for CVMs so that SWITLB can be
> scaled up as needed.
> 4) Ensure SWIOTLB pool is 2MB aligned so that all the conversions happen at
> 2M granularity once during boot.
> 5) Add a check to ensure all conversions happen at 2M granularity.
>
> ** This series leaves out some of the conversion sites which might not
> be 2M aligned but should be easy to fix once the approach is finalized. **
>
> 1G alignment for conversion:
> * Using 1G alignment may cause over-allocated SWIOTLB buffers but might
> be acceptable for CVMs depending on more considerations.
> * It might be challenging to use 1G aligned conversion in OVMF. 2M
> alignment should be achievable with OVMF changes [3].
>
> Alternatives could be:
> 1) Separate hugepage aligned DMA pools setup by individual device drivers in
> case of CVMs.
>
> [1] https://lore.kernel.org/linux-mips/20231105163040.14904-1-pbonzini@xxxxxxxxxx/
> [2] https://www.kernel.org/doc/html/next/mm/vmemmap_dedup.html
> [3] https://github.com/tianocore/edk2/pull/3784
> [4] https://lore.kernel.org/lkml/20230908080031.GA7848@xxxxxx/T/
>
> Vishal Annapurve (5):
> swiotlb: Support allocating DMA memory from SWIOTLB
> swiotlb: Allow setting up default alignment of SWIOTLB region
> x86: CVMs: Enable dynamic swiotlb by default for CVMs
> x86: CVMs: Allow allocating all DMA memory from SWIOTLB
> x86: CVMs: Ensure that memory conversions happen at 2M alignment
>
> arch/x86/Kconfig | 2 ++
> arch/x86/kernel/pci-dma.c | 2 +-
> arch/x86/mm/mem_encrypt.c | 8 ++++++--
> arch/x86/mm/pat/set_memory.c | 6 ++++--
> include/linux/swiotlb.h | 22 ++++++----------------
> kernel/dma/direct.c | 4 ++--
> kernel/dma/swiotlb.c | 17 ++++++++++++-----
> 7 files changed, 33 insertions(+), 28 deletions(-)
>
> --
> 2.43.0.275.g3460e3d667-goog
>

Ping for review of this series.

Thanks,
Vishal