Re: [RFC 00/48] RISC-V CoVE support

From: Atish Patra
Date: Wed Apr 19 2023 - 19:00:06 EST


On Thu, Apr 20, 2023 at 3:47 AM Atish Patra <atishp@xxxxxxxxxxxx> wrote:
>
> This patch series adds the RISC-V Confidential VM Extension (CoVE) support to
> Linux kernel. The RISC-V CoVE specification introduces non-ISA, SBI APIs. These
> APIs enable a confidential environment in which a guest VM's data can be isolated
> from the host while the host retains control of guest VM management and platform
> resources(memory, CPU, I/O).
>
> This is a very early WIP work. We want to share this with the community to get any
> feedback on overall architecture and direction. Any other feedback is welcome too.
>
> The detailed CoVE architecture document can be found here [0]. It used to be
> called AP-TEE and renamed to CoVE recently to avoid overloading term of TEE in
> general. The specification is in the draft stages and is subjected to change based
> on the feedback from the community.
>
> The CoVE specification introduces 3 new SBI extensions.
> COVH - CoVE Host side interface
> COVG - CoVE Guest side interface
> COVI - CoVE Secure Interrupt management extension
>
> Some key acronyms introduced:
>
> TSM - TEE Security Manager
> TVM - TEE VM (aka Confidential VM)
>
> CoVE Architecture:
> ====================
> The CoVE APIs are designed to be implementation and architecture agnostic,
> allowing for different deployment models while retaining common host and guest
> kernel code. Two examples are shown in Figure 1 and Figure 2.
> As shown in both figures, the architecture introduces a new software component
> called the "TEE Security Manager" (TSM) that runs in HS mode. The TSM has minimal
> hw attested footprint on TCB as it is a passive component that doesn't support
> scheduling or timer interrupts. Both example deployment models provide memory
> isolation between the host and the TEE VM (TVM).
>
>
> Non secure world | Secure world |
> | |
> Non | |
> Virtualized | Virtualized | Virtualized Virtualized |
> Env | Env | Env Env |
> +----------+ | +----------+ | +----------+ +----------+ | --------------
> | | | | | | | | | | |
> | Host Apps| | | Apps | | | Apps | | Apps | | VU-Mode
> | (VMM) | | | | | | | | | |
> +----------+ | +----------+ | +----------+ +----------+ | --------------
> | | +----------+ | +----------+ +----------+ |
> | | | | | | | | | |
> | | | | | | TVM | | TVM | |
> | | | Guest | | | Guest | | Guest | | VS-Mode
> Syscalls | +----------+ | +----------+ +----------+ |
> | | | | |
> | SBI | SBI(COVG + COVI) |
> | | | | |
> +--------------------------+ | +---------------------------+ --------------
> | Host (Linux) | | | TSM (Salus) |
> +--------------------------+ | +---------------------------+
> | | | HS-Mode
> SBI (COVH + COVI) | SBI (COVH + COVI)
> | | |
> +-----------------------------------------------------------+ --------------
> | Firmware(OpenSBI) + TSM Driver | M-Mode
> +-----------------------------------------------------------+ --------------
> +-----------------------------------------------------------------------------
> | Hardware (RISC-V CPU + RoT + IOMMU)
> +----------------------------------------------------------------------------
> Figure 1: Host in HS model
>
>
> The deployment model shown in Figure 1 runs the host in HS mode where it is peer
> to the TSM which also runs in HS mode. It requires another component known as TSM
> Driver running in higher privilege mode than host/TSM. It is responsible for switching
> the context between the host and the TSM. TSM driver also manages the platform
> specific hardware solution via confidential domain bit as described in the specification[0]
> to provide the required memory isolation.
>
>
> Non secure world | Secure world
> |
> Virtualized Env | Virtualized Virtualized |
> Env Env |
> +-------------------------+ | +----------+ +----------+ | ------------
> | | | | | | | | | | |
> | Host Apps| | | Apps | | | Apps | | Apps | | VU-Mode
> +----------+ | +----------+ | +----------+ +----------+ | ------------
> | | | | |
> Syscalls SBI | | | |
> | | | | |
> +--------------------------+ | +-----------+ +-----------+ |
> | Host (Linux) | | | TVM Guest| | TVM Guest| | VS-Mode
> +--------------------------+ | +-----------+ +-----------+ |
> | | | | |
> SBI (COVH + COVI) | SBI SBI |
> | | (COVG + COVI) (COVG + COVI)|
> | | | | |
> +-----------------------------------------------------------+ --------------
> | TSM(Salus) | HS-Mode
> +-----------------------------------------------------------+ --------------
> |
> SBI
> |
> +---------------------------------------------------------+ --------------
> | Firmware(OpenSBI) | M-Mode
> +---------------------------------------------------------+ --------------
> +-----------------------------------------------------------------------------
> | Hardware (RISC-V CPU + RoT + IOMMU)
> +----------------------------------------------------------------------------
> Figure 2: Host in VS model
>
>
> The deployment model shown in Figure 2 simplifies the context switch and memory isolation
> by running the host in VS mode as a guest of TSM. Thus, the memory isolation is
> achieved by gstage mapping by the TSM. We don't need any additional hardware confidential
> domain bit to provide memory isolation. The downside of this model the host has to run the
> non-confidential VMs in nested environment which may have lower performance (yet to be measured).
> The current implementation Salus(TSM) doesn't support full nested virtualization yet.
>
> The platform must have a RoT to provide attestation in either model.
> This patch series implements the APIs defined by CoVE. The current TSM implementation
> allows the host to run TVMs as shown in figure 2. We are working on deployment
> model 1 in parallel. We do not expect any significant changes in either host/guest side
> ABI due to that.
>
> Shared memory between the host & TSM:
> =====================================
> To accelerate the H-mode CSR/GPR access, CoVE also reuses the Nested Acceleration (NACL)
> SBI extension[1]. NACL defines a per physical cpu shared memory area that is allocated
> at the boot. It allows the host running in VS mode to access H-mode CSR/GPR easily
> without trapping into the TSM. The CoVE specification clearly defines the exact
> state of the shared memory with r/w permissions at every call.
>
> Secure Interrupt management:
> ===========================
> The CoVE specification relies on the MSI based interrupt scheme defined in Advanced Interrupt
> Architecture specification[2]. The COVI SBI extension adds functions to bind
> a guest interrupt file to a TVMs. After that, only TCB components (TSM, TVM, TSM driver)
> can modify that. The host can inject an interrupt via TSM only.
> The TVMs are also in complete control of which interrupts it can receive. By default,
> all interrupts are denied. In this proof-of-concept implementation, all the interrupts
> are allowed by the guest at boot time to keep it simple.
>
> Device I/O:
> ===========
> In order to support paravirt I/O devices, SWIOTLB bounce buffer must be used by the
> guest. As the host can not access confidential memory, this buffer memory
> must be shared with the host via share/unshare functions defined in COVG SBI extension.
> RISC-V implementation achieves this generalizing mem_encrypt_init() similar to TDX/SEV/CCA.
> That's why, the CoVE Guest is only allowed to use virtio devices with VIRTIO_F_ACCESS_PLATFORM
> and VIRTIO_F_VERSION_1 as they force virtio drivers to use the DMA API.
>
> MMIO emulation:
> ======================
> TVM can register regions of address space as MMIO regions to be emulated by
> the host. TSM provides explicit SBI functions i.e. SBI_EXT_COVG_[ADD/REMOVE]_MMIO_REGION
> to request/remove MMIO regions. Any reads or writes to those MMIO region after
> SBI_EXT_COVG_ADD_MMIO_REGION call are forwarded to the host for emulation.
>
> This series allows any ioremapped memory to be emulated as MMIO region with
> above APIs via arch hookups inspired from pKVM work. We are aware that this model
> doesn't address all the threat vectors. We have also implemented the device
> filtering/authorization approach adopted by TDX[4]. However, those patches are not
> part of this series as the base TDX patches are still under active development.
> RISC-V CoVE will also adapt the revamped device filtering work once it is accepted
> by the Linux community in the future.
>
> The direct assignment of devices are a work in progress and will be added in the future[4].
>
> VMM support:
> ============
> This series is only tested with kvmtool support. Other VMM support (qemu-kvm, crossvm/rust-vmm)
> will be added later.
>
> Test cases:
> ===========
> We are working on kvm selftest for CoVE. We will post them as soon as they are ready.
> We haven't started any work on kvm unit-tests as RISC-V doesn't have basic infrastructure
> to support that. Once the kvm uni-test infrastructure is in place, we will add
> support for CoVE as well.
>
> Open design questions:
> ======================
>
> 1. The current implementation has two separate configs for guest(CONFIG_RISCV_COVE_GUEST)
> and the host (RISCV_COVE_HOST). The default defconfig will enable both so that
> same unified image works as both host & guest. Most likely distro prefer this way
> to minimize the maintenance burden but some may want a minimal CoVE guest image
> that has only hardened drivers. In addition to that, Android runs a microdroid instance
> in the confidential guests. A separate config will help in those case. Please let us
> know if there is any concern with two configs.
>
> 2. Lazy gstage page allocation vs upfront allocation with page pool.
> Currently, all gstage mappings happen at runtime during the fault. This is expensive
> as we need to convert that page to confidential memory as well. A page pool framework
> may be a better choice which can hold all the confidential pages which can be
> pre-allocated upfront. A generic page pool infrastructure may benefit other CC solutions ?
>
> 3. In order to allow both confidential VM and non-confidential VM, the series
> uses regular branching instead of static branches for CoVE VM specific cases through
> out KVM. That may cause a few more branch penalties while running regular VMs.
> The alternate option is to use function pointers for any function that needs to
> take a different path. As per my understanding, that would be worse than branches.
>
> Patch organization:
> ===================
> This series depends on quite a few RISC-V patches that are not upstream yet.
> Here are the dependencies.
>
> 1. RISC-V IPI improvement series
> 2. RISC-V AIA support series.
> 3. RISC-V NACL support series
>
> In this series, PATCH [0-5] are generic improvement and cleanup patches which
> can be merged independently.
>
> PATCH [6-26, 34-37] adds host side for CoVE.
> PATCH [27-33] adds the interrupt related changes.
> PATCH [34-49] Adds the guest side changes for CoVE.
>
> The TSM project is written in rust and can be found here:
> https://github.com/rivosinc/salus
>
> Running the stack
> ====================
>
> To run/test the stack, you would need the following components :
>
> 1) Qemu
> 2) Common Host & Guest Kernel
> 3) kvmtool
> 4) Host RootFS with KVMTOOL and Guest Kernel
> 5) Salus
>
> The detailed steps are available at[6]
>
> The Linux kernel patches are also available at [7] and the kvmtool patches
> are available at [8].
>
> TODOs
> =======
> As this is a very early work, the todo list is quite long :).
> Here are some of them (not in any specific order)
>
> 1. Support fd based private memory interface proposed in
> https://lkml.org/lkml/2022/1/18/395
> 2. Align with updated guest runtime device filtering approach.
> 3. IOMMU integration
> 4. Dedicated device assignment via TDSIP & SPDM[4]
> 5. Support huge pages
> 6. Page pool allocator to avoid convert/reclaim at every fault
> 7. Other VMM support (qemu-kvm, crossvm)
> 8. Complete the PoC for the deployment model 1 where host runs in HS mode
> 9. Attestation integration
> 10. Harden the interrupt allowed list
> 11. kvm self-tests support for CoVE
> 11. kvm unit-tests support for CoVE
> 12. Guest hardening
> 13. Port pKVM on RISC-V using CoVE
> 14. Any other ?
>
> Links
> ============
> [0] CoVE architecture Specification.
> https://github.com/riscv-non-isa/riscv-ap-tee/blob/main/specification/riscv-aptee-spec.pdf

I just noticed that this link is broken due to a recent PR merge. Here
is the updated link
https://github.com/riscv-non-isa/riscv-ap-tee/blob/main/specification/riscv-cove.pdf

Sorry for the noise.

> [1] https://lists.riscv.org/g/sig-hypervisors/message/260
> [2] https://github.com/riscv/riscv-aia/releases/download/1.0-RC2/riscv-interrupts-1.0-RC2.pdf
> [3] https://github.com/rivosinc/linux/tree/cove_integration_device_filtering1
> [4] https://github.com/intel/tdx/commits/guest-filter-upstream
> [5] https://lists.riscv.org/g/tech-ap-tee/message/83
> [6] https://github.com/rivosinc/cove/wiki/CoVE-KVM-RISCV64-on-QEMU
> [7] https://github.com/rivosinc/linux/commits/cove-integration
> [8] https://github.com/rivosinc/kvmtool/tree/cove-integration-03072023
>
> Atish Patra (33):
> RISC-V: KVM: Improve KVM error reporting to the user space
> RISC-V: KVM: Invoke aia_update with preempt disabled/irq enabled
> RISC-V: KVM: Add a helper function to get pgd size
> RISC-V: Add COVH SBI extensions definitions
> RISC-V: KVM: Implement COVH SBI extension
> RISC-V: KVM: Add a barebone CoVE implementation
> RISC-V: KVM: Add UABI to support static memory region attestation
> RISC-V: KVM: Add CoVE related nacl helpers
> RISC-V: KVM: Implement static memory region measurement
> RISC-V: KVM: Use the new VM IOCTL for measuring pages
> RISC-V: KVM: Exit to the user space for trap redirection
> RISC-V: KVM: Return early for gstage modifications
> RISC-V: KVM: Skip dirty logging updates for TVM
> RISC-V: KVM: Add a helper function to trigger fence ops
> RISC-V: KVM: Skip most VCPU requests for TVMs
> RISC-V : KVM: Skip vmid/hgatp management for TVMs
> RISC-V: KVM: Skip TLB management for TVMs
> RISC-V: KVM: Register memory regions as confidential for TVMs
> RISC-V: KVM: Add gstage mapping for TVMs
> RISC-V: KVM: Handle SBI call forward from the TSM
> RISC-V: KVM: Implement vcpu load/put functions for CoVE guests
> RISC-V: KVM: Wireup TVM world switch
> RISC-V: KVM: Skip HVIP update for TVMs
> RISC-V: KVM: Implement COVI SBI extension
> RISC-V: KVM: Add interrupt management functions for TVM
> RISC-V: KVM: Skip AIA CSR updates for TVMs
> RISC-V: KVM: Perform limited operations in hardware enable/disable
> RISC-V: KVM: Indicate no support user space emulated IRQCHIP
> RISC-V: KVM: Add AIA support for TVMs
> RISC-V: KVM: Hookup TVM VCPU init/destroy
> RISC-V: KVM: Initialize CoVE
> RISC-V: KVM: Add TVM init/destroy calls
> drivers/hvc: sbi: Disable HVC console for TVMs
>
> Rajnesh Kanwal (15):
> mm/vmalloc: Introduce arch hooks to notify ioremap/unmap changes
> RISC-V: KVM: Update timer functionality for TVMs.
> RISC-V: Add COVI extension definitions
> RISC-V: KVM: Read/write gprs from/to shmem in case of TVM VCPU.
> RISC-V: Add COVG SBI extension definitions
> RISC-V: Add CoVE guest config and helper functions
> RISC-V: Implement COVG SBI extension
> RISC-V: COVE: Add COVH invalidate, validate, promote, demote and
> remove APIs.
> RISC-V: KVM: Add host side support to handle COVG SBI calls.
> RISC-V: Allow host to inject any ext interrupt id to a CoVE guest.
> RISC-V: Add base memory encryption functions.
> RISC-V: Add cc_platform_has() for RISC-V for CoVE
> RISC-V: ioremap: Implement for arch specific ioremap hooks
> riscv/virtio: Have CoVE guests enforce restricted virtio memory
> access.
> RISC-V: Add shared bounce buffer to support DBCN for CoVE Guest.
>
> arch/riscv/Kbuild | 2 +
> arch/riscv/Kconfig | 27 +
> arch/riscv/cove/Makefile | 2 +
> arch/riscv/cove/core.c | 40 +
> arch/riscv/cove/cove_guest_sbi.c | 109 +++
> arch/riscv/include/asm/cove.h | 27 +
> arch/riscv/include/asm/covg_sbi.h | 38 +
> arch/riscv/include/asm/csr.h | 2 +
> arch/riscv/include/asm/kvm_cove.h | 206 +++++
> arch/riscv/include/asm/kvm_cove_sbi.h | 101 +++
> arch/riscv/include/asm/kvm_host.h | 10 +-
> arch/riscv/include/asm/kvm_vcpu_sbi.h | 3 +
> arch/riscv/include/asm/mem_encrypt.h | 26 +
> arch/riscv/include/asm/sbi.h | 107 +++
> arch/riscv/include/uapi/asm/kvm.h | 17 +
> arch/riscv/kernel/irq.c | 12 +
> arch/riscv/kernel/setup.c | 2 +
> arch/riscv/kvm/Makefile | 1 +
> arch/riscv/kvm/aia.c | 101 ++-
> arch/riscv/kvm/aia_device.c | 41 +-
> arch/riscv/kvm/aia_imsic.c | 127 ++-
> arch/riscv/kvm/cove.c | 1005 +++++++++++++++++++++++
> arch/riscv/kvm/cove_sbi.c | 490 +++++++++++
> arch/riscv/kvm/main.c | 30 +-
> arch/riscv/kvm/mmu.c | 45 +-
> arch/riscv/kvm/tlb.c | 11 +-
> arch/riscv/kvm/vcpu.c | 69 +-
> arch/riscv/kvm/vcpu_exit.c | 34 +-
> arch/riscv/kvm/vcpu_insn.c | 115 ++-
> arch/riscv/kvm/vcpu_sbi.c | 16 +
> arch/riscv/kvm/vcpu_sbi_covg.c | 232 ++++++
> arch/riscv/kvm/vcpu_timer.c | 26 +-
> arch/riscv/kvm/vm.c | 34 +-
> arch/riscv/kvm/vmid.c | 17 +-
> arch/riscv/mm/Makefile | 3 +
> arch/riscv/mm/init.c | 17 +-
> arch/riscv/mm/ioremap.c | 45 +
> arch/riscv/mm/mem_encrypt.c | 61 ++
> drivers/tty/hvc/hvc_riscv_sbi.c | 5 +
> drivers/tty/serial/earlycon-riscv-sbi.c | 51 +-
> include/uapi/linux/kvm.h | 8 +
> mm/vmalloc.c | 16 +
> 42 files changed, 3222 insertions(+), 109 deletions(-)
> create mode 100644 arch/riscv/cove/Makefile
> create mode 100644 arch/riscv/cove/core.c
> create mode 100644 arch/riscv/cove/cove_guest_sbi.c
> create mode 100644 arch/riscv/include/asm/cove.h
> create mode 100644 arch/riscv/include/asm/covg_sbi.h
> create mode 100644 arch/riscv/include/asm/kvm_cove.h
> create mode 100644 arch/riscv/include/asm/kvm_cove_sbi.h
> create mode 100644 arch/riscv/include/asm/mem_encrypt.h
> create mode 100644 arch/riscv/kvm/cove.c
> create mode 100644 arch/riscv/kvm/cove_sbi.c
> create mode 100644 arch/riscv/kvm/vcpu_sbi_covg.c
> create mode 100644 arch/riscv/mm/ioremap.c
> create mode 100644 arch/riscv/mm/mem_encrypt.c
>
> --
> 2.25.1
>


--
Regards,
Atish