[PATCH RFC v9 00/51] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support

From: Michael Roth
Date: Mon Jun 12 2023 - 00:28:35 EST


This patchset is also available at:

https://github.com/amdese/linux/commits/snp-host-v9-rfc

and is based on top of the following tree:

https://github.com/mdroth/linux/commits/kvm_gmem_solo_fixes

which in turn is based on Sean Christopherson's UPM base support tree,
with a couple fixes/workarounds needed for SEV/SNP support. [1]

== OVERVIEW ==

This patchset implements SEV-SNP hypervisor support for linux.

This version is being posted as an RFC due to fairly extensive changes
relating to transitioning the SEV-SNP implementation to using
guest_memfd (gmem, aka Unmapped Private Memory) to manage private guest
pages instead of the legacy SEV memory registration ioctls.

For this purpose we've added a number of hooks on top of gmem to plumb
in necessary RMP table updates corresponding when mapping private
memory into a guest's nested page table, and then restoring it to
shared/hypervisor-owned state we free'ing gmem-allocated memory back to
the host. Our hope is that some of these hooks can be re-used for other
platforms as well, but have tried to make them as minimal as possible if
they prove to be SNP-specific. For quicker review of this aspect, they
are at the beginning of the series, directly on top of the gmem patchset.

Outside of UPM-related items, we've also included fairly extensive changes
based on review feedback from v8 and would appreciate any feedback on
those aspects as well.


== LAYOUT ==

PATCH 01-05: Pre-patches that add generic gmem and KVM MMU hooks to handle
plumbing gmem memory into CoCo guests, and make arch/x86/coco
re-usability for common SEV host code instead of only guest
code..
PATCH 06-22: Host SNP initialization code and CCP driver prep for handling
SNP cmds
PATCH 13-22: general SNP detection/enablement for host and CCP driver
PATCH 23-46: core KVM support for running SEV-SNP guests
PATCH 47-51: misc handling for IOMMU support, guest request handling, and
debug infrastructure


== TESTING (note updated QEMU command-lines) ==

For testing this via QEMU, use the following tree:

https://github.com/amdese/qemu/commits/snp-wip-gmem

SEV-SNP with gmem/UPM enabled:

# set discard=none to disable discarding memory post-conversion, faster
# boot times, but increased memory usage
qemu-system-x86_64 -cpu EPYC-Milan-v2 \
-object memory-backend-memfd-private,id=ram1,size=1G,share=true \
-object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,discard=both \
-machine q35,confidential-guest-support=sev0,memory-backend=ram1,kvm-type=protected \
...

KVM selftests for UPM:

cd $kernel_src_dir
make -C tools/testing/selftests TARGETS="kvm" EXTRA_CFLAGS="-DDEBUG -I<path to kernel headers>"
sudo tools/testing/selftests/kvm/x86_64/private_mem_conversions_test


== BACKGROUND (SEV-SNP) ==

This part of the Secure Encrypted Paging (SEV-SNP) series focuses on the
changes required in a host OS for SEV-SNP support. The series builds upon
SEV-SNP Guest Support now part of mainline.

This series provides the basic building blocks to support booting the SEV-SNP
VMs, it does not cover all the security enhancement introduced by the SEV-SNP
such as interrupt protection.

The CCP driver is enhanced to provide new APIs that use the SEV-SNP
specific commands defined in the SEV-SNP firmware specification. The KVM
driver uses those APIs to create and managed the SEV-SNP guests.

The GHCB specification version 2 introduces new set of NAE's that is
used by the SEV-SNP guest to communicate with the hypervisor. The series
provides support to handle the following new NAE events:

- Register GHCB GPA
- Page State Change Request
- Hypevisor feature
- Guest message request

When pages are marked as guest-owned in the RMP table, they are assigned
to a specific guest/ASID, as well as a specific GFN with in the guest. Any
attempts to map it in the RMP table to a different guest/ASID, or a
different GFN within a guest/ASID, will result in an RMP nested page fault.

Prior to accessing a guest-owned page, the guest must validate it with a
special PVALIDATE instruction which will set a special bit in the RMP table
for the guest. This is the only way to set the validated bit outisde of the
initial pre-encrypted guest payload/image; any attempts outside the guest to
modify the RMP entry from that point forward will result in the validated
bit being cleared, at which point the guest will trigger an exception if it
attempts to access that page so it can be made aware of possible tampering.

One exception to this is the initial guest payload, which is pre-validated
by the firmware prior to launching. The guest can use Guest Message requests
to fetch an attestation report which will include the measurement of the
initial image so that the guest can verify it was booted with the expected
image/environment.

After boot, guests can use Page State Change requests to switch pages
between shared/hypervisor-owned and private/guest-owned to share data for
things like DMA, virtio buffers, and other GHCB requests.

In this implementation SEV-SNP, private guest memory is managed by a new
kernel framework called guest_memfd (gmem). With gmem, a new
KVM_SET_MEMORY_ATTRIBUTES KVM ioctl has been added to tell the KVM
MMU whether a particular GFN should be backed by shared (normal) memory or
private (gmem-allocated) memory. To tie into this, Page State Change
requests are forward to userspace via KVM_EXIT_VMGEXIT exits, which will
then issue the corresponding KVM_SET_MEMORY_ATTRIBUTES call to set the
private/shared state in the KVM MMU.

The gmem / KVM MMU hooks added in this series will then update the RMP table
entries for the backing PFNs to set them to guest-owned/private when mapping
private pages into the guest via KVM MMU, or use the normal KVM MMU handling
in the case of shared pages where the corresponding RMP table entries are
left in the default shared/hypervisor-owned state.

== TODO / KNOWN ISSUES ==

* Add a per-arch CONFIG option for enabling platform-specific handling
when invalidating gmem pages and free'ing the back to host, as opposed
to the current approach which defaults to issuing invalidations to a
weak-referenced stub implementation for non-x86 builds. Hoping for more
feedback on general implementation first.
* This should incorporate all review feedback from v8, but if anything
slipped through the cracks please let me know.

[1] https://lore.kernel.org/lkml/20230512002124.3sap3kzxpegwj3n2@xxxxxxx/

Changes since v8:

* Rework gmem/UPM hooks based on Sean's latest gmem/UPM tree
* Move SEV lazy-pinning support out to a separate series which uses this
series as a prereq instead of the other way around.
* Re-organize extended guest request patches into 3 patches encompassing
SEV FD ioctls for host-wide certs, KVM ioctls for per-instance certs,
and the guest request handling that consumes them. Also move them to
the top of the series to better separate them for the core SNP patches
(Alexey, Zhi, Ashish, Dov, Dionna, others)
* Various other changes/fixups for extended guests request handling (Dov,
Alexey, Dionna)
* Use helper to calculate max RMP entry size and improve readability (Dave)
* Use architecture-independent GPA value for initial VMSA pages
* Ensure SEV_CMD_SNP_GUEST_REQUEST failures are indicated to guest (Alex)
* Allocate per-instance certs on-demand (Alex)
* comment fixup for RMP fault handling (Zhi)
* commit msg rewording for MSR-based PSCs (Zhi)
* update SNP command/struct definitions based on 1.54 ABI (Saban)
* use sev_deactivate_lock around SEV_CMD_SNP_DECOMMISSION (Saban)
* Various comment/commit fixups (Zhi, Alex, Kim, Vlastimil, Dave,
* kexec fixes for newer SNP firmwares (Ashish)
* Various other fixups and re-ordering of patches.

Changes since v7:

* Rebase to Sean's updated UPM base support tree
* Drop KVM_CAP_UNMAPPED_MEMORY and .private_mem_enabled x86 op in favor
of kvm_arch_has_private_mem() and vm_type KVM_VM_CREATE arg
* Drop GHCB map/unmap refactoring and post map/unmap hooks as they are no
longer needed with UPM
* Move .fault_is_private implementation to SNP patch range, no longer
needed for SEV.
* Don't call attribute update / invalidation hooks under kvm->mmu_lock
(Tom, Jarkko)
* Revert switch to using set_memory_p()/set_memory_np() in rmpupdate() due
to it causing performance regression
* Commit fixups for 'fault_is_private'/'update_mem_attr' hooks, have
'fault_is_private' return bool (Boris)
* Split kvm_vm_set_region_attr() into separate patch. (Jarkko)
* Copy corrected CPUID page to userspace when firmware rejects it (Tom,
Jarkko)
* Fix sev_dump_rmpentry() error-handling (Alper)
* Use adjusted cmd_buf pointer rather than sev->cmd_buf directly (Alper)
* Correct typo in SNP_GET_EXT_CONFIG documentation (Dov)
* Update struct kvm_sev_snp_launch_finish definition in
amd-memory-encryption.rst (Tom)
* Fix snp_launch_update_vmsa replacing created_vcpus with online_vcpus
* Fix SNP_DBG_DECRYPT to not include len parameter.
* Fix SNP_LAUNCH_FINISH to copy host-data from userspace


Changes since v6:

* Added support for restrictedmem/UPM, and removed SEV-specific
implementation of private memory management. As a result of this rework
the following patches were no longer needed so were dropped:
- KVM: SVM: Mark the private vma unmergable for SEV-SNP guests
- KVM: SVM: Disallow registering memory range from HugeTLB for SNP guest
- KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX and SNP
- KVM: x86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use
* Moved RMP table entry structure definition (struct rmpentry)
to sev.c, to not expose this non-architectural definition to rest
of the kernel and making the structure private to SNP code.
Also made RMP table entry accessors to be inline functions and
removed all accessors which are not called more than once.
Added a new function rmptable_entry() to index into the RMP table
and return RMP table entry.
* Moved RMPUPDATE, PSMASH helper function declerations to x86 arch
specific include namespace from linux namespace. Added comments
for these helper functions.
* Introduce set_memory_p() to provide a way to change atributes of a
memory range to be marked as present and added to the kernel
directmap, and invalidating/restoring pages from directmap are
now done using set_memory_np() and set_memory_p().
* Added detailed comments around user RMP #PF fault handling and
simplified computation of the faulting pfn for large-pages.
* Added support to return pfn from dump_pagetable() to do SEV-specific
fault handling, this is added a pre-patch. This support is now
used to dump RMP entry in case of RMP #PF in show_fault_oops().
* Added a new generic SNP command params structure sev_data_snp_addr,
which is used for all SNP firmware API commands requiring a
single physical address parameter.
* Added support for new SNP_INIT_EX command with support for HV-Fixed
page range list.
* Added support for new SNP_SHUTDOWN_EX command which allows
disabling enforcement of SNP in the IOMMU. Also DF_FLUSH is done
at SNP shutdown if it indicates DF_FLUSH is required.
* Make sev_do_cmd() a generic API interface for the hypervisor
to issue commands to manage an SEV and SNP guest. Also removed
the API wrappers used by the hypervisor to manage an SEV-SNP guest.
All these APIs now invoke sev_do_cmd() directly.
* Introduce snp leaked pages list. If pages are unsafe to be released
back to the page-allocator as they can't be reclaimed or
transitioned back to hypervisor/shared state are now added
to this internal leaked pages list to prevent fatal page faults
when accessing these pages. The function snp_leak_pages() is
renamed to snp_mark_pages_offline() and is an external function
available to both CCP driver and the SNP hypervisor code. Removed
call to memory_failure() when leaking/marking pages offline.
* Remove snp_set_rmp_state() multiplexor code and add new separate
helpers such as rmp_mark_pages_firmware() & rmp_mark_pages_shared().
The callers now issue snp_reclaim_pages() directly when needed as
done by __snp_free_firmware_pages() and unmap_firmware_writeable().
All callers of snp_set_rmp_state() modified to call helpers
rmp_mark_pages_firmware() or rmp_mark_pages_shared() as required.
* Change snp_reclaim_pages() to take physical address as an argument
and clear C-bit from this physical address argument internally.
* Output parameter sev_user_data_ext_snp_config in sev_ioctl_snp_get_config()
is memset to zero to avoid kernel memory leaking.
* Prevent race between sev_ioctl_snp_set_config() and
snp_guest_ext_guest_request() for sev->snp_certs_data by acquiring
sev->snp_certs_lock mutex.
* Zeroed out struct sev_user_data_snp_config in
sev_ioctl_snp_set_config() to prevent leaking uninitialized
kernel memory.
* Optimized snp_safe_alloc_page() by avoiding multiple calls to
pfn_to_page() and checking for a hugepage using pfn instead of
expanding to full physical address.
* Invoke host_rmp_make_shared() with leak parameter set to true
if VMSA page cannot be transitioned back to shared state.
* Fix snp_launch_finish() to always sent the ID_AUTH struct to
the firmware. Use params.auth_key_en indicator to set
if the ID_AUTH struct contains an author key or not.
* Cleanup snp_context_create() and allocate certs_data in this
function using kzalloc() to prevent giving the guest
uninitialized kernel memory.
* Remove the check for guest supplied buffer greater than the data
provided by the hypervisor in snp_handle_ext_guest_request().
* Add check in sev_snp_ap_create() if a malicious guest can
RMPADJUST a large page into VMSA which will hit the SNP erratum
where the CPU will incorrectly signal an RMP violation #PF if a
hugepage collides with the RMP entry of VMSA page, reject the
AP CREATE request if VMSA address from guest is 2M aligned.
* Make VMSAVE target area memory allocation SNP safe, implemented
workaround for an SNP erratum where the CPU will incorrectly signal
an RMP violation #PF if a hugepage (2mb or 1gb) collides with the
RMP entry of the VMSAVE target page.
* Fix handle_split_page_fault() to work with memfd backed pages.
* Add KVM commands for per-VM instance certificates.
* Add IOMMU_SNP_SHUTDOWN support, this adds support for Host kexec
support with SNP.

Documentation/virt/coco/sev-guest.rst | 54 +
Documentation/virt/kvm/api.rst | 34 +
.../virt/kvm/x86/amd-memory-encryption.rst | 147 ++
arch/x86/Kbuild | 2 +-
arch/x86/coco/Makefile | 3 +-
arch/x86/coco/sev/Makefile | 3 +
arch/x86/coco/sev/host.c | 524 ++++++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/kvm-x86-ops.h | 3 +
arch/x86/include/asm/kvm_host.h | 23 +
arch/x86/include/asm/msr-index.h | 11 +-
arch/x86/include/asm/sev-common.h | 30 +
arch/x86/include/asm/sev-host.h | 37 +
arch/x86/include/asm/sev.h | 5 +-
arch/x86/include/asm/svm.h | 6 +
arch/x86/include/asm/trap_pf.h | 18 +-
arch/x86/kernel/cpu/amd.c | 24 +-
arch/x86/kernel/cpu/bugs.c | 7 +-
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/lapic.c | 5 +-
arch/x86/kvm/mmu.h | 2 -
arch/x86/kvm/mmu/mmu.c | 15 +-
arch/x86/kvm/mmu/mmu_internal.h | 39 +-
arch/x86/kvm/svm/nested.c | 2 +-
arch/x86/kvm/svm/sev.c | 1802 +++++++++++++++++---
arch/x86/kvm/svm/svm.c | 53 +-
arch/x86/kvm/svm/svm.h | 38 +-
arch/x86/kvm/x86.c | 17 +
arch/x86/mm/fault.c | 21 +
drivers/crypto/ccp/sev-dev.c | 1064 +++++++++++-
drivers/crypto/ccp/sev-dev.h | 16 +
drivers/iommu/amd/init.c | 57 +-
include/linux/amd-iommu.h | 3 +-
include/linux/kvm_host.h | 10 +
include/linux/psp-sev.h | 304 +++-
include/uapi/linux/kvm.h | 74 +
include/uapi/linux/psp-sev.h | 71 +
tools/arch/x86/include/asm/cpufeatures.h | 1 +
virt/kvm/guest_mem.c | 48 +-
virt/kvm/kvm_main.c | 75 +-
41 files changed, 4383 insertions(+), 275 deletions(-)