Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

From: Sean Christopherson
Date: Fri Jul 14 2023 - 15:29:31 EST


On Thu, Jul 13, 2023, Ackerley Tng wrote:
> Sean Christopherson <seanjc@xxxxxxxxxx> writes:
>
> > On Fri, May 05, 2023, Ackerley Tng wrote:
> >>
> >> Hi Sean,
> >>
> >> Thanks for implementing this POC!
> >>
> >> ... snip ...
> >>
> >
> > I don't love either approach idea because it means a file created in the context
> > of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
> > that it can't do anything with except close(). I doubt that matters in practice
> > though, e.g. when the VM dies, all memory can be freed so that the file ends up
> > being little more than a shell. And if we go that route, there's no need to grab
> > a reference to the file during bind, KVM can just grab a longterm reference when
> > the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).
> >
> > ... snip ...
> >
> > My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl()
> > and not a common syscall. If the file isn't tightly coupled to a single VM, then
> > punching a hole is further complicated by needing to deal with invalidating multiple
> > regions that are bound to different @kvm instances. It's not super complex, but
> > AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think
> > having one VM own the memory will complicate even if/when we get to the point where
> > VMs can share "private" memory, and the gmem code would still need to deal with
> > grabbing a module reference.
>
> I’d like to follow up on this discussion about a guest_mem file
> outliving the VM and whether to have a VM-scoped ioctl or a KVM ioctl.
>
> Here's a POC of delayed binding of a guest_mem file to a memslot, where
> the guest_mem file outlives the VM [1].
>
> I also hope to raise some points before we do the first integration of
> guest_mem patches!
>
> A use case for guest_mem inodes outliving the VM is when the host VMM
> needs to be upgraded.

Translation: to support intra-host migration, a.k.a. KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM

> The guest_mem inode is passed between two VMs on the same host machine and
> all memory associated with the inode needs to be retained.
>
> To support the above use case, binding of memslots is delayed until
> first use, so that the following inode passing flow can be used:
>
> 1. Source (old version of host VMM) process passes guest_mem inodes to
> destination (new version of host VMM) process via unix sockets.
> 2. Destination process initializes memslots identical to source process.
> 3. Destination process invokes ioctl to migrate guest_mem inode over to
> destination process by unbinding all memslots from the source VM and
> binding them to the destination VM. (The kvm pointer is updated in
> this process)
>
> Without delayed binding, step 2 will fail since initialization of
> memslots would check and find that the kvm pointer in the guest_mem
> inode points to the kvm in the source process.
>
>
> These two patches contain the meat of the changes required to support
> delayed binding:
>
> https://github.com/googleprodkernel/linux-cc/commit/93b31a006ef2e4dbe1ef0ec5d2534ca30f3bf60c
> https://github.com/googleprodkernel/linux-cc/commit/dd5ac5e53f14a1ef9915c9c1e4cc1006a40b49df
>
> Some things to highlight for the approach set out in these two patches:
>
> 1. Previously, closing the guest_mem file in userspace is taken to mean
> that all associated memory is to be removed and cleared. With these
> two patches, each memslot also holds a reference to the file (and
> therefore inode) and so even if the host VMM closes the fd, the VM
> will be able to continue to function.
>
> This is desirable to userspace since closing the file should not be
> interpreted as a command to clear memory.

100% agreed. However, after more thought since we first discussed this, I think
that a deferred binding is the wrong way to solve this particular problem. More
below.

> This is aligned with the
> way tmpfs files are used with KVM before guest_mem: when the file is
> closed in userspace, the memory contents are still mapped and can
> still be used by the VM. fallocate(PUNCH_HOLE) is how userspace
> should command memory to be removed, just like munmap() would be used
> to remove memory from use by KVM.
>
> 2. Creating a guest mem file no longer depends on a specific VM and
> hence the guest_mem creation ioctl can be a system ioctl instead of a
> VM specific ioctl. This will also address Chao's concern at [3].

That concern is a non-issue for QEMU as memory backends are created after
accelerators are initialized, and AFAICT Google's VMM behaves similarly.

And _if_ there is a VMM that instantiates memory before KVM_CREATE_VM, IMO making
the ioctl() /dev/kvm scoped would have no meaningful impact on adapting userspace
to play nice with the required ordering. If userspace can get at /dev/kvm, then
it can do KVM_CREATE_VM, because the only input to KVM_CREATE_VM is the type, i.e.
the only dependencies for KVM_CREATE_VM should be known/resolved long before the
VMM knows it wants to use gmem.

Using a non-KVM syscall would eliminate any such dependencies, but in my very
strong opinion, that is not a good reason to go with a syscall.

> I also separated cleaning up files vs inodes in
> https://github.com/googleprodkernel/linux-cc/commit/0f5aa18910c515141e57e05c4cc791022047a242,
> which I believe is more aligned with how files and inodes are cleaned up
> in FSes.

I didn't take these, though I am in the process of incorporating parts of the
underlying feedback (see below).

> This alignment makes it easier to extend gmem to hugetlb, for one.

I am not convinced that utilizing hugetlb is the best way to provide 1GiB support
in gmem. I'm not necessarily against it, but there's a fair bit of uncertainty
around the future of hugetlb, and there are fundamental aspects of hugetlb that
may be non-goals for the gmem use case, e.g. mapping the memory with 1GiB pages
in the userspace page tables likely isn't necessariy, and might even be
undesirable.

And practically speaking, the changes aren't _that_ invasive, i.e. punting any
necessary refactoring should not substantially increase the size/complexity of
hugetlb support (*if* we end up adding it).

That said, I do think we'll implement .evict_inode() very early on in order to
support SNP and TDX, because (in addition to PUNCH_HOLE) that's when the backing
memory will be freed and thus reclaimed, i.e. unassigned in the RMP (SNP) / zeroed
with the shared key ID (TDX).

> While working on this, I was also wondering if we should perhaps be
> storing the inode pointer in slot->gmem instead of the file pointer? The
> memory is associated with an inode->mapping rather than the file. Are we
> binding to a userspace handle on the inode (store file pointer), or are
> we really referencing the inode (store inode pointer)?

Conceptually, I think KVM should to bind to the file. The inode is effectively
the raw underlying physical storage, while the file is the VM's view of that
storage.

Practically, I think that gives us a clean, intuitive way to handle intra-host
migration. Rather than transfer ownership of the file, instantiate a new file
for the target VM, using the gmem inode from the source VM, i.e. create a hard
link. That'd probably require new uAPI, but I don't think that will be hugely
problematic. KVM would need to ensure the new VM's guest_memfd can't be mapped
until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the
memslots/bindings are identical), but that should be easy enough to enforce.

That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing
the memory and the *contents* of memory to outlive the VM, i.e. be effectively
transfered to the new target VM. And we'll maintain the invariant that each
guest_memfd is bound 1:1 with a single VM.

As above, that should also help us draw the line between mapping memory into a
VM (file), and freeing/reclaiming the memory (inode).

There will be extra complexity/overhead as we'll have to play nice with the
possibility of multiple files per inode, e.g. to zap mappings across all files
when punching a hole, but the extra complexity is quite small, e.g. we can use
address_space.private_list to keep track of the guest_memfd instances associated
with the inode.

Setting aside TDX and SNP for the moment, as it's not clear how they'll support
memory that is "private" but shared between multiple VMs, I think per-VM files
would work well for sharing gmem between two VMs. E.g. would allow a give page
to be bound to a different gfn for each VM, would allow having different permissions
for each file (e.g. to allow fallocate() only from the original owner).