RE: [PATCH v8 00/14] KVM: Dirty ring interface

From: Tian, Kevin
Date: Sun Apr 26 2020 - 06:30:01 EST


> From: Peter Xu
> Sent: Friday, April 24, 2020 10:20 PM
>
> On Fri, Apr 24, 2020 at 06:01:46AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu <peterx@xxxxxxxxxx>
> > > Sent: Thursday, April 23, 2020 11:23 PM
> > >
> > > On Thu, Apr 23, 2020 at 06:28:43AM +0000, Tian, Kevin wrote:
> > > > > From: Peter Xu <peterx@xxxxxxxxxx>
> > > > > Sent: Thursday, April 23, 2020 2:52 AM
> > > > >
> > > > > Hi,
> > > > >
> > > > > TL;DR: I'm thinking whether we should record pure GPA/GFN instead
> of
> > > > > (slot_id,
> > > > > slot_offset) tuple for dirty pages in kvm dirty ring to unbind
> > > kvm_dirty_gfn
> > > > > with memslots.
> > > > >
> > > > > (A slightly longer version starts...)
> > > > >
> > > > > The problem is that binding dirty tracking operations to KVM
> memslots is
> > > a
> > > > > restriction that needs synchronization to memslot changes, which
> further
> > > > > needs
> > > > > synchronization across all the vcpus because they're the consumers of
> > > > > memslots.
> > > > > E.g., when we remove a memory slot, we need to flush all the dirty
> bits
> > > > > correctly before we do the removal of the memslot. That's actually an
> > > > > known
> > > > > defect for QEMU/KVM [1] (I bet it could be a defect for many other
> > > > > hypervisors...) right now with current dirty logging. Meanwhile, even
> if
> > > we
> > > > > fix it, that procedure is not scale at all, and error prone to dead locks.
> > > > >
> > > > > Here memory removal is really an (still corner-cased but relatively)
> > > important
> > > > > scenario to think about for dirty logging comparing to memory
> additions
> > > &
> > > > > movings. Because memory addition will always have no initial dirty
> page,
> > > > > and
> > > > > we don't really move RAM a lot (or do we ever?!) for a general VM
> use
> > > case.
> > > > >
> > > > > Then I went a step back to think about why we need these dirty bit
> > > > > information
> > > > > after all if the memslot is going to be removed?
> > > > >
> > > > > There're two cases:
> > > > >
> > > > > - When the memslot is going to be removed forever, then the dirty
> > > > > information
> > > > > is indeed meaningless and can be dropped, and,
> > > > >
> > > > > - When the memslot is going to be removed but quickly added back
> with
> > > > > changed
> > > > > size, then we need to keep those dirty bits because it's just a
> commmon
> > > > > way
> > > > > to e.g. punch an MMIO hole in an existing RAM region (here I'd
> confess
> > > I
> > > > > feel like using "slot_id" to identify memslot is really unfriendly
> syscall
> > > > > design for things like "hole punchings" in the RAM address space...
> > > > > However such "punch hold" operation is really needed even for a
> > > common
> > > > > guest for either system reboots or device hotplugs, etc.).
> > > >
> > > > why would device hotplug punch a hole in an existing RAM region?
> > >
> > > I thought it could happen because I used to trace the KVM ioctls and see
> the
> > > memslot changes during driver loading. But later when I tried to hotplug
> a
> >
> > Is there more detail why driver loading may lead to memslot changes?
>
> E.g., I can observe these after Linux loads and before the prompt, which is a
> simplest VM with default devices on:
>
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#3 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#65539 flags=0x0
> gpa=0xfd000000 size=0x0 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#3 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41874@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#65539 flags=0x1
> gpa=0xfd000000 size=0x1000000 ua=0x7fadf6800000 ret=0
> 41875@xxxxxxxxxxxxxxxxx:kvm_set_user_memory Slot#9 flags=0x1
> gpa=0xa0000 size=0x10000 ua=0x7fadf6800000 ret=0
>
> After a careful look, I noticed it's only the VGA device mostly turning slot 3
> off & on. Frankly speaking I don't know why it happens to do so.
>
> >
> > > device I do see that it won't... The new MMIO regions are added only
> into
> > > 0xfe000000 for a virtio-net:
> > >
> > > 00000000fe000000-00000000fe000fff (prio 0, i/o): virtio-pci-common
> > > 00000000fe001000-00000000fe001fff (prio 0, i/o): virtio-pci-isr
> > > 00000000fe002000-00000000fe002fff (prio 0, i/o): virtio-pci-device
> > > 00000000fe003000-00000000fe003fff (prio 0, i/o): virtio-pci-notify
> > > 00000000fe840000-00000000fe84002f (prio 0, i/o): msix-table
> > > 00000000fe840800-00000000fe840807 (prio 0, i/o): msix-pba
> > >
> > > Does it mean that device plugging is guaranteed to not trigger RAM
> changes?
> >
> > I'd think so. Otherwise from guest p.o.v any device hotplug implies doing
> > a memory hot-unplug first then it's a bad design.
>
> Right that's what I was confused about. Then maybe you're right. :)
>
> >
> > > I
> > > am really curious about what cases we need to consider in which we
> need to
> > > keep
> > > the dirty bits for a memory removal, and if system reset is the only case,
> then
> > > it could be even easier (because we might be able to avoid the sync in
> > > memory
> > > removal but do that once in a sys reset hook)...
> >
> > Possibly memory hot-unplug, as allowed by recent virtio-mem?
>
> That should belong to the case where dirty bits do not matter at all after the
> removal, right? I would be mostly curious about when we (1) remove a
> memory
> slot, and at the meantime (2) we still care about the dirty bits of that slot.

I remember one case. PCIe spec defines a resizable BAR capability, allowing
hardware to communicate supported resource sizes and software to set
an optimal size back to hardware. If such a capability is presented to guest
and the BAR is backed by memory, it's possible to observe a removal-and-
add-back scenario. However in such case, the spec requires the software
to clear memory space enable bit in command register before changing
the BAR size. I suppose such thing should happen at boot phase where
dirty bits related to old BAR size don't matter.

>
> I'll see whether I can remove the dirty bit sync in kvm_set_phys_mem(),
> which I
> think is really nasty.
>
> >
> > btw VFIO faces a similar problem when unmapping a DMA range (e.g.
> when
> > vIOMMU is enabled) in dirty log phase. There could be some dirty bits
> which are
> > not retrieved when unmapping happens. VFIO chooses to return the dirty
> > bits in a buffer passed in the unmapping parameters. Can memslot
> interface
> > do similar thing by allowing the userspace to specify a buffer pointer to
> hold
> > whatever dirty pages recorded for the slot that is being removed?
>
> Yes I think we can, but may not be necessary. Actually IMHO CPU access to
> pages are slightly different to device DMAs in that we can do these sequence
> to
> collect the dirty bits of a slot safely:
>
> - mark slot as READONLY
> - KVM_GET_DIRTY_LOG on the slot
> - finally remove the slot
>
> I guess VFIO cannot do that because there's no way to really "mark the
> region
> as read-only" for a device because DMA could happen and DMA would fail
> then
> when writting to a readonly slot.
>
> On the KVM/CPU side, after we mark the slot as READONLY then the CPU
> writes
> will page fault and fallback to the QEMU userspace, then QEMU will take care
> of
> the writes (so those writes could be slow but still working) then even if we
> mark it READONLY it won't fail the writes but just fallback to QEMU.

Yes, you are right.

>
> Btw, since we're discussing the VFIO dirty logging across memory removal...
> the
> unmapping DMA range you're talking about needs to be added back later,
> right?

pure memory removal doesn't need add-back. Or are you specifically
referring to removal-and-add-back case?

> Then what if the device does DMA after the removal but before it's added
> back?

then just got IOMMU page fault, but I guess that I didn't get your real
question...

> I think this is a more general question not for dirty logging but also for when
> dirty logging is not enabled - I never understand how this could be fixed with
> existing facilities.
>

Thanks,
Kevin