[RFC V1 0/5] x86: CVMs: Align memory conversions to 2M granularity

From: Vishal Annapurve
Date: Fri Jan 12 2024 - 00:53:05 EST


Goal of this series is aligning memory conversion requests from CVMs to
huge page sizes to allow better host side management of guest memory and
optimized page table walks.

This patch series is partially tested and needs more work, I am seeking
feedback from wider community before making further progress.

Background
=====================
Confidential VMs(CVMs) support two types of guest memory ranges:
1) Private Memory: Intended to be consumed/modified only by the CVM.
2) Shared Memory: visible to both guest/host components, used for
non-trusted IO.

Guest memfd [1] support is set to be merged upstream to handle guest private
memory isolation from host usersapace. Guest memfd approach allows following
setup:
* private memory backed using the guest memfd file which is not accessible
from host userspace.
* Shared memory backed by tmpfs/hugetlbfs files that are accessible from
host userspace.

Userspace VMM needs to register two backing stores for all of the guest
memory ranges:
* HVA for shared memory
* Guest memfd ranges for private memory

KVM keeps track of shared/private guest memory ranges that can be updated at
runtime using IOCTLs. This allows KVM to back the guest memory using either HVA
(shared) or guest memfd file offsets (private) based on the attributes of the
guest memory ranges.

In this setup, there is possibility of "double allocation" i.e. scenarios where
both shared and private memory backing stores mapped to the same guest memory
ranges have memory allocated.

Guest issues an hypercall to convert the memory types which is forwarded by KVM
to the host userspace.
Userspace VMM is supposed to handle conversion as follows:
1) Private to shared conversion:
* Update guest memory attributes for the range to be shared using KVM
supported IOCTLs.
- While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
to the guest memory being converted.
* Unback the guest memfd range.
2) Shared to private conversion:
* Update guest memory attributes for the range to be private using KVM
supported IOCTLs.
- While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
to the guest memory being converted.
* Unback the shared memory file.

Note that unbacking needs to be done for both kinds of conversions in order to
avoid double allocation.

Problem
=====================
CVMs can convert memory between these two types at 4K granularity. Conversion
done at 4K granularity causes issues when using guest memfd support
with hugetlb/Hugepage backed guest private memory:
1) Hugetlb fs doesn't allow freeing subpage ranges when punching holes,
causing all the private to shared memory conversions to result in double
allocation.
2) Even if a new fs is implemented for guest memfd that allows splitting
hugepages, punching holes at 4K will cause:
- loss of vmemmmap optimization [2]
- more memory for EPT/NPT entries and extra pagetable walks for guest
side accesses.
- Shared memory mappings to consume more host pagetable entries and
extra pagetalble walks for host side access.
- Higher number of conversions with additional overhead of VM exits
serviced by host userspace.

Memory conversion scenarios in the guest that are of major concern:
- SWIOTLB area conversion early during boot.
* dma_map_* API invocations for CVMs result in using bounce buffers
from SWIOTLB region which is already marked as shared.
- Device drivers allocating memory using dma_alloc_* APIs at runtime
that bypass SWIOTLB.

Proposal
=====================
To counter above issues, this series proposes following:
1) Use boot time allocated SWIOTLB pools for all DMA memory allocated
using dma_alloc_* APIs.
2) Increase memory allocated at boot for SWIOTLB from 6% to 8% for CVMs.
3) Enable dynamic SWIOTLB [4] by default for CVMs so that SWITLB can be
scaled up as needed.
4) Ensure SWIOTLB pool is 2MB aligned so that all the conversions happen at
2M granularity once during boot.
5) Add a check to ensure all conversions happen at 2M granularity.

** This series leaves out some of the conversion sites which might not
be 2M aligned but should be easy to fix once the approach is finalized. **

1G alignment for conversion:
* Using 1G alignment may cause over-allocated SWIOTLB buffers but might
be acceptable for CVMs depending on more considerations.
* It might be challenging to use 1G aligned conversion in OVMF. 2M
alignment should be achievable with OVMF changes [3].

Alternatives could be:
1) Separate hugepage aligned DMA pools setup by individual device drivers in
case of CVMs.

[1] https://lore.kernel.org/linux-mips/20231105163040.14904-1-pbonzini@xxxxxxxxxx/
[2] https://www.kernel.org/doc/html/next/mm/vmemmap_dedup.html
[3] https://github.com/tianocore/edk2/pull/3784
[4] https://lore.kernel.org/lkml/20230908080031.GA7848@xxxxxx/T/

Vishal Annapurve (5):
swiotlb: Support allocating DMA memory from SWIOTLB
swiotlb: Allow setting up default alignment of SWIOTLB region
x86: CVMs: Enable dynamic swiotlb by default for CVMs
x86: CVMs: Allow allocating all DMA memory from SWIOTLB
x86: CVMs: Ensure that memory conversions happen at 2M alignment

arch/x86/Kconfig | 2 ++
arch/x86/kernel/pci-dma.c | 2 +-
arch/x86/mm/mem_encrypt.c | 8 ++++++--
arch/x86/mm/pat/set_memory.c | 6 ++++--
include/linux/swiotlb.h | 22 ++++++----------------
kernel/dma/direct.c | 4 ++--
kernel/dma/swiotlb.c | 17 ++++++++++++-----
7 files changed, 33 insertions(+), 28 deletions(-)

--
2.43.0.275.g3460e3d667-goog