Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates

From: Sean Christopherson
Date: Thu Sep 29 2022 - 17:39:18 EST


On Thu, Sep 29, 2022, Paolo Bonzini wrote:
>
> On 9/28/22 22:41, Sean Christopherson wrote:
> If you think we should overhaul it even more than just providing atomic
> batched updates, that's fine.

Or maybe solve the problem(s) with more precision? What I don't like about the
batching is that it's not "atomic", and AFAICT doesn't have line of sight to truly
becoming atomic. When I hear "atomic" updates, I think "all transactions are
commited at once". That is not what this proposed API does, and due to the TLB
flushing requirements, I don't see how it can be implemented.

> But still, the impossibility to perform
> atomic updates in batches *is* a suboptimal part of the KVM API.

I don't disagree, but I honestly don't see the difference between "batching" and
providing a KVM_MEM_USER_EXIT flag. The batch API will provide better performance,
but I would be surprised if it's significant enough to matter. I'm not saying
that KVM_MEM_USER_EXIT provides novel functionality, e.g. I think we're in
agreement that handling memslot metadata will suck regardless. My point is that
it's simpler to implement in KVM, much easier to document, and for all intents and
purposes provides the same desired functionality: vCPUs that access in-flux memslots
are stalled and so userspace doesn't have to pause and rendezvous all vCPUs to
provide "atomic" updates.

> The main cases are:
>
> - for the boot case, splitting and merging existing memslots. QEMU likes to
> merge adjacent memory regions into a single memslot, so if something goes
> from read-write to read-only it has to be split and vice versa. I guess a
> "don't merge this memory region" flag would be the less hideous way to solve
> it in userspace.
>
> - however, there is also the case of resizing an existing memslot which is
> what David would like to have for virtio-mem. This is not really fixable
> because part of the appeal of virtio-mem is to have a single huge memslot
> instead of many smaller ones, in order to reduce the granularity of
> add/remove (David, correct me if I'm wrong).
>
> (The latter _would_ be needed by other VMMs).

If we really want to provide a better experience for userspace, why not provide
more primitives to address those specific use cases? E.g. fix KVM's historic wart
of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to:

- Merge 2+ memory regions that are contiguous in GPA and HVA
- Split a memory region into 2+ memory regions
- Truncate a memory region
- Grow a memory region

That would probably require more KVM code overall, but each operation would be more
tightly bounded and thus simpler to define. And I think more precise APIs would
provide other benefits, e.g. growing a region wouldn't require first deleting the
current region, and so could avoid zapping SPTEs and destroying metadata. Merge,
split, and truncate would have similar benefits, though they might be more
difficult to realize in practice.

What I really don't like about the "batch" API, aside from the "atomic" name, is
that it's too open ended and creates conflicts. E.g. to allow "atomic" merging
after converting from RO=>RW, KVM needs to allow DELETE=>DELETE=>CREATE, and that
implies that each operation in the batch operates on the working state created by
the previous operations.

But if userspace provides a batch that does DELETE=>CREATE=>DELETE, naively hoisting
all "invalidations" to the front half of the "atomic" operations breaks the
ordering. And there's a ridiculous amount of potential weirdness with MOVE=>MOVE
operations.

KVM could solve those issues by disallowing MOVE and disallowing DELETE after
CREATE or FLAGS_ONLY, but then userspace is effectively restricted to operating
on a single contiguous chunk of memslots. At that point, it seems that providing
primitives to do each macro operation is an even better experience for userspace.

In other words, make memslots a bit more CISC ;-)