Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

From: Sean Christopherson
Date: Fri May 05 2023 - 20:55:52 EST


On Fri, May 05, 2023, Ackerley Tng wrote:
>
> Hi Sean,
>
> Thanks for implementing this POC!
>
> I’ve started porting the selftests (both Chao’s and those I added [1]).
>
> guest mem seems to cover the use cases that have been discussed and
> proposed so far, but I still need to figure out how gmem can work with
>
> + hugetlbfs
> + specification of/storing memory policy (for NUMA node bindings)
> + memory accounting - we may need to account for memory used separately,
> so that guest mem shows up separately on /proc/meminfo and similar
> places.
>
> One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is
> not cleaned up, and hence it is possible to crash the host kernel in the
> following way
>
> 1. Create a KVM VM
> 2. Create a guest mem fd on that VM
> 3. Create a memslot with the guest mem fd (hence binding the fd to the
> VM)
> 4. Close/destroy the KVM VM
> 5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
> when it tries to do invalidation.
>
> I then tried to clean up the gmem->kvm pointer during unbinding when the
> KVM VM is destroyed.
>
> That works, but then I realized there’s a simpler way to use the pointer
> after freeing:
>
> 1. Create a KVM VM
> 2. Create a guest mem fd on that VM
> 3. Close/destroy the KVM VM
> 4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
> when it tries to do invalidation.
>
> Perhaps binding should mean setting the gmem->kvm pointer in addition to
> gmem->bindings. This makes binding and unbinding symmetric and avoids
> the use-after-frees described above.

Hrm, that would work, though it's a bit convoluted, e.g. would require detecting
when the last binding is being removed. A similar (also ugly) solution would be
to nullify gmem->kvm when KVM dies.

I don't love either approach idea because it means a file created in the context
of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
that it can't do anything with except close(). I doubt that matters in practice
though, e.g. when the VM dies, all memory can be freed so that the file ends up
being little more than a shell. And if we go that route, there's no need to grab
a reference to the file during bind, KVM can just grab a longterm reference when
the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).

Blech, another wart is that I believe gmem would need to do __module_get() during
file creation to prevent kvm.ko from being unloaded after the last VM dies. Ah,
but that'd also be true if we went with a system-scoped KVM ioctl(), so I suppose
it's not _that_ ugly.

Exchanging references (at binding or at creation) doesn't work, because that
creates a circular dependency, i.e. gmem and KVM would pin each other.

A "proper" refcounting approach, where the file pins KVM and not vice versa, gets
nasty because of how KVM's memslots work. The least awful approach I can think of
would be to delete the associated memslot(s) when the file is released, possibly
via deferred work to avoid deadlock issues. Not the prettiest thing ever and in
some ways that'd yield an even worse ABI.

Side topic, there's a second bug (and probably more lurking): kvm_swap_active_memslots()'s
call to synchronize_srcu_expedited() is done _before_ the call to kvm_gmem_unbind(),
i.e. doesn't wait for readers in kvm_gmem_invalidate_begin() to go away. The easy
solution for that one is to add another synchronize_srcu_expedited() after unbinding.

> This also means that creating a guest mem fd is no longer dependent on
> the VM. Perhaps we can make creating a gmem fd a system ioctl (like
> KVM_GET_API_VERSION and KVM_CREATE_VM) instead of a vm ioctl?

My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl()
and not a common syscall. If the file isn't tightly coupled to a single VM, then
punching a hole is further complicated by needing to deal with invalidating multiple
regions that are bound to different @kvm instances. It's not super complex, but
AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think
having one VM own the memory will complicate even if/when we get to the point where
VMs can share "private" memory, and the gmem code would still need to deal with
grabbing a module reference.