Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

From: Sean Christopherson
Date: Mon May 15 2023 - 19:47:07 EST


On Fri, May 05, 2023, Sean Christopherson wrote:
> On Fri, May 05, 2023, Ackerley Tng wrote:
> > One issue I’ve found so far is that the pointer to kvm (gmem->kvm) is
> > not cleaned up, and hence it is possible to crash the host kernel in the
> > following way
> >
> > 1. Create a KVM VM
> > 2. Create a guest mem fd on that VM
> > 3. Create a memslot with the guest mem fd (hence binding the fd to the
> > VM)
> > 4. Close/destroy the KVM VM
> > 5. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
> > when it tries to do invalidation.
> >
> > I then tried to clean up the gmem->kvm pointer during unbinding when the
> > KVM VM is destroyed.
> >
> > That works, but then I realized there’s a simpler way to use the pointer
> > after freeing:
> >
> > 1. Create a KVM VM
> > 2. Create a guest mem fd on that VM
> > 3. Close/destroy the KVM VM
> > 4. Call fallocate(PUNCH_HOLE) on the guest mem fd, which uses gmem->kvm
> > when it tries to do invalidation.
> >
> > Perhaps binding should mean setting the gmem->kvm pointer in addition to
> > gmem->bindings. This makes binding and unbinding symmetric and avoids
> > the use-after-frees described above.
>
> Hrm, that would work, though it's a bit convoluted, e.g. would require detecting
> when the last binding is being removed. A similar (also ugly) solution would be
> to nullify gmem->kvm when KVM dies.
>
> I don't love either approach idea because it means a file created in the context
> of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
> that it can't do anything with except close(). I doubt that matters in practice
> though, e.g. when the VM dies, all memory can be freed so that the file ends up
> being little more than a shell. And if we go that route, there's no need to grab
> a reference to the file during bind, KVM can just grab a longterm reference when
> the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).
>
> Blech, another wart is that I believe gmem would need to do __module_get() during
> file creation to prevent kvm.ko from being unloaded after the last VM dies. Ah,
> but that'd also be true if we went with a system-scoped KVM ioctl(), so I suppose
> it's not _that_ ugly.
>
> Exchanging references (at binding or at creation) doesn't work, because that
> creates a circular dependency, i.e. gmem and KVM would pin each other.
>
> A "proper" refcounting approach, where the file pins KVM and not vice versa, gets
> nasty because of how KVM's memslots work. The least awful approach I can think of
> would be to delete the associated memslot(s) when the file is released, possibly
> via deferred work to avoid deadlock issues. Not the prettiest thing ever and in
> some ways that'd yield an even worse ABI.

Circling back to this. Pending testing, the "proper" refcounting approach seems
to be the way to go. KVM's existing memslots actually work this way, e.g. if
userspace does munmap() on an active memslot, KVM zaps any PTEs but the memslot
stays alive. A similar approach can be taken here, the only wrinkle is that the
association between gmem and the memslot is stronger than between VMAs and memslots,
specifically KVM binds the file and not simply the file descriptor. This is
necessary because not binding to an exact file would let userspace install a
different file at the file descriptor.

That's solvable without having to delete memslots though, e.g. by protecting the
file pointer in the memslot with RCU and directly bumping the refcount in the two
places where KVM needs to get at gmem (the file) via the memslot (unbind and
faulting in a page). E.g.

static struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
{
struct file *file;

rcu_read_lock();

file = rcu_dereference(slot->gmem.file);
if (file && !get_file_rcu(file))
file = NULL;
rcu_read_unlock();

return file;
}

The gotcha is that ->release could race with memslot deletion, as kvm_gmem_unbind()
won't be able to differentiate between "file was deleted" and "file is currently
being deleted". That's easy enough to deal with though, kvm_gmem_release() can
take slots_lock to prevent the memslot from going away when nullifying and
invalidating ranges for the memslot.

> Side topic, there's a second bug (and probably more lurking): kvm_swap_active_memslots()'s
> call to synchronize_srcu_expedited() is done _before_ the call to kvm_gmem_unbind(),
> i.e. doesn't wait for readers in kvm_gmem_invalidate_begin() to go away. The easy
> solution for that one is to add another synchronize_srcu_expedited() after unbinding.

There's a bug here, but not the one I pointed out. Acquiring kvm->srcu doesn't
provide any protection, the binding already has a pointer to the memslot, i.e.
isn't doing an SRCU-protected lookup in the memslots. The actual protection is
provided by the filemap invalidate lock, which prevents unbinding a memslot until
all invalidations complete, i.e. acquiring kvm->srcu in the punch hole path is
completely unnecessary.