[RFC PATCH 0/6] KVM: mm: fd-based approach for supporting KVM guest private memory

From: Chao Peng
Date: Thu Nov 11 2021 - 09:14:53 EST


This RFC series try to implement the fd-based KVM guest private memory
proposal described at [1].

We had some offline discussions on this series already and that results
a different design proposal from Paolo. This thread includes both the
original RFC patch series for proposal [1] as well as the summary for
the new proposal from Paolo so that we can continue the discussion.

To understand the patch and the new proposal you are highly recommended
to read the original proposal [1] firstly.


Patch Description
=================
The patch include a private memory implementation in memfd/shmem backing
store and KVM support for private memory slot as well its counterpart in
QEMU.

Patch1: kernel part shmem/memfd support
Patch2-6: KVM part
Patch7-13: QEMU part

QEMU Usage:
-machine private-memory-backend=ram1 \
-object memory-backend-memfd,id=ram1,size=5G,guest_private=on,seal=off


New Proposal
============
Below is a summary of the changes for the new proposal that was discussed
in the offline thread.

In general, this new proposal reuses the concept of fd-based guest
memory backing store that described in [1] but uses a different way to
coordinate the private and shared parts into one single memslot instead
of introducing dedicated private memslot.

- memslot extension
The new proposal suggests to add the private fd and the offset to
existing 'shared' memslot so both private/shared memory can live in one
single memslot. A page in the memslot is either private or shared. A
page is private only when it's allocated in the private fd, all the
other cases it's treated as shared, this includes those already mapped
as shared as well as those having not been mapped.

- private memory map/unmap
Userspace's map/unmap operations are done by fallocate() ioctl on
private fd.
- map: default fallocate() with mode=0.
- unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.

There would be two new callbacks registered by KVM and called by memory
backing store during above map/unmap operations:
- map(inode, offset, size): memory backing store to tell related KVM
memslot to do a shared->private conversion.
- unmap(inode, offset, size): memory backing store to tell related KVM
memslot to do a private->shared conversion.

Memory backing store also needs to provide a new callback for KVM to
query if a page is already allocated in private-fd so KVM can know if
the page is private or not.
- page_allocated(inode, offset): for shmem this would simply return
pagecache_get_page().

There are two places in KVM that can exit to userspace to trigger
private/share conversion:
- explicit conversion: happens when guest calls into KVM to explicitly
map a range(as private or shared), KVM then exits to userspace to do
the above map/unmap operations.
- implicit conversion: happens in KVM page fault handler.
* if fault due to a private memory access then cause a userspace exit
for a shared->private conversion request when page_allocate() return
false, otherwise map that directly without usrspace exit.
* If fault due to a shared memory access then cause a userspace exit
for a private->shared conversion request when page_allocate() return
true, otherwise map that directly without userspace exit.

An example flow:

guest Linux userspace
------------------------- -------------------- -----------------------
ioctl(KVM_RUN)
access private memoryd
'--- EPT violation --.
v
userspace exit
'------------------.
v
munmap shared memfd
fallocate private memfd
.------------------'
v
fallocate()
call guest_ops
unmap shared PTE
map private PTE
...
ioctl(KVM_RUN)

Compared to the original proposal:
- no need to introduce KVM memslot hole punching API,
- would avoid potential memslot performance/scalability/fragment issue,
- may also reduce userspace complexity,
- but requires additional callbacks between KVM and memory backing
store.

[1] https://lkml.kernel.org/kvm/51a6f74f-6c05-74b9-3fd7-b7cd900fb8cc@xxxxxxxxxx/t/

Thanks,
Chao
---
Chao Peng (6):
mm: Add F_SEAL_GUEST to shmem/memfd
kvm: x86: Introduce guest private memory address space to memslot
kvm: x86: add private_ops to memslot
kvm: x86: implement private_ops for memfd backing store
kvm: x86: add KVM_EXIT_MEMORY_ERROR exit
KVM: add KVM_SPLIT_MEMORY_REGION

Documentation/virt/kvm/api.rst | 1 +
arch/x86/include/asm/kvm_host.h | 5 +-
arch/x86/include/uapi/asm/kvm.h | 4 +
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/memfd.c | 63 +++++++++++
arch/x86/kvm/mmu/mmu.c | 69 ++++++++++--
arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
arch/x86/kvm/x86.c | 3 +-
include/linux/kvm_host.h | 41 ++++++-
include/linux/memfd.h | 22 ++++
include/linux/shmem_fs.h | 9 ++
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/kvm.h | 34 ++++++
mm/memfd.c | 34 +++++-
mm/shmem.c | 127 +++++++++++++++++++++-
virt/kvm/kvm_main.c | 185 +++++++++++++++++++++++++++++++-
16 files changed, 581 insertions(+), 22 deletions(-)
create mode 100644 arch/x86/kvm/memfd.c

--
2.17.1