Re: [PATCH 08/11] KVM: implement multiple address spaces

From: Radim KrÄmÃÅ
Date: Tue May 19 2015 - 14:50:51 EST


2015-05-19 18:19+0200, Paolo Bonzini:
> On 19/05/2015 15:32, Radim KrÄmÃÅ wrote:
> > I'd prefer to decouple address spaces and slots. KVM_GET_DIRTY_LOG
> > could stay the same if we said that a slot can be in multiple address
> > spaces.
>
> Actually, my original plan was a third one. I wanted to have a fixed
> number of address spaces, where CPUs would use only the first ones (0
> for non-x86, 0+1 for x86). Then I would let userspace pick a "dirty log
> address space" which, in the case of QEMU, would simply use ram_addr_t
> as the address.
>
> However, that doesn't work because when marking a page dirty you need
> either a (slots, gfn) or a (memslot, relative address in memslot) pair.
> Given the gfn that the VCPU has faulted on, it would be very expensive
> to find the corresponding gfn in the "dirty log address space".

If I understand your aim correctly, we could go with just one dirty map
in this way:
- have a dirty bitmap that covers userspace memory of all slots that
have enabled dirty log
- store an offset to this bitmap in those slots

It would be a framework that allows userspace to query dirty pages based
on userspace address space. (Changing slots would be costly, but I
don't think it's done very often.)

> On the
> contrary, it's very easy if the VCPU can simply query its current memslots.

Yeah, userspace drew the short stick.

> In your proposal, there is the problem of where to do the overlap check.
> You cannot do it in the slots because it messes up userspace seriously.
>
> QEMU for example wants to use as few slots as possible and merges
> consecutive slots; but what is mergeable in one address space need not
> be mergeable in another. And if you do it in the address spaces, you
> have not solved the problem of multiple dirty bitmaps pointing for the
> same userspace address.

The worst case is the same (two totally different mappings), but some
cases could make it with less slots. With hierarchy (more below), we
could even allow overlapping over a large slot.

I didn't want to solve the "multiple bitmaps for same userspace" because
KVM_GET_DIRTY_LOG doesn't care about userspace memory space dirtiness
(thought it's probably what we use it for), but only about the guest
address space; KVM_GET_DIRTY_LOG just says which which page was
modified in a slot :/
(If we wanted dirty bitmap for userspace memory, it would be better to
use the solution at the top.)

Less slots just means less worries ...

> > (Well, we could do that even now, by searching for slots that
> > differ only in address space id and pointing them to same dirty bitmap.
> > This even makes sense, which is a sign of lacking design :)
>
> I thought about this, but ultimately it sounded like premature
> optimization (and also might not even work due to the aforementioned
> problem with merging of adjacent slots).
>
> A possible optimization is to set a flag when no bits are set in the
> dirty bitmap, and skip the iteration. This is the common case for the
> SMM memory space for example. But it can be done later, and is an
> independent work.

True. (I'm not a huge fan of optimizations through exceptions.)

> Keeping slots separate for different address spaces also makes the most
> sense in QEMU, because each address space has a separate MemoryListener
> that only knows about one address space. There is one log_sync callback
> per address space and no code has to know about all address spaces.

Ah, I would have thought this is done per slot, when there was no
concept of address space in KVM before this patch :)

> > The main drawback is that forcing decoupling on existing IOCTLs would be
> > really awkward ... we'd need to have new API for address spaces;
> > there are two basic operations on an address space:
> > add and remove slot (needs: slot id, address space id)
> > which would give roughly the same functionality as this patch.
> > (To maximixe compatibility, there could be a default address space and a
> > slot flag that doesn't automatical
> > On top of this, we could allow hierarchical address spaces, so very similar
> > address spaces (like SMM) would be easier to set up.
>
> That would actually be really messy. :)
>
> The regular and SMM address spaces are not hierarchical. As soon as you
> put a PCI resource underneath SMRAM---which is exactly what happens for
> legacy VRAM at 0xa0000---they can be completely different. Note that
> QEMU can map legacy VRAM as a KVM memslot when using the VGA 320x200x256
> color mode (this mapping is not correct from the VGA point of view, but
> it cannot be changed in QEMU without breaking migration).

How is a PCI resource under SMRAM accessed?
I thought that outside SMM, PCI resource under SMRAM is working
normally, but it will be overshadowed, and made inaccessible, in SMM.

I'm not sure if we mean the same hierarchy. I meant hierarchy in the
sense than one address space is considered before the other.
(Maybe layers would be a better word.)
SMM address space could have just one slot and be above regular, we'd
then decide how to handle overlapping.

> What I do dislike in the API is the 16-bit split of the slots field; but
> the alternative of defining KVM_SET_USER_MEMORY_REGION2 and
> KVM_GET_DIRTY_LOG2 ioctls is just as sad. If you prefer it that way, I
> can certainly do that.

No, what we have now is much better.

I'd like to know why a taken route is better than a one I'd prefer,
which probably makes me look like I have a problem with everything,
but your solution is neat.

> Is it okay for you if I separate the cleanup patches, and then repost
> the actual SMM series after the API discussions have settled? The
> cleanups should be easily reviewable, and they make some sense on their
> own. I'm currently autotesting them.

Yes, it would be much cleaner. (I wasn't able to apply these patches,
so I just skimmed it and a proper review will take time.)

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/