Re: [PATCH v3 00/21] KVM: Dirty ring interface

From: Peter Xu
Date: Thu Jan 09 2020 - 14:39:56 EST


On Thu, Jan 09, 2020 at 02:08:52PM -0500, Michael S. Tsirkin wrote:
> On Thu, Jan 09, 2020 at 12:08:49PM -0500, Peter Xu wrote:
> > On Thu, Jan 09, 2020 at 11:40:23AM -0500, Michael S. Tsirkin wrote:
> >
> > [...]
> >
> > > > > I know it's mostly relevant for huge VMs, but OTOH these
> > > > > probably use huge pages.
> > > >
> > > > Yes huge VMs could benefit more, especially if the dirty rate is not
> > > > that high, I believe. Though, could you elaborate on why huge pages
> > > > are special here?
> > > >
> > > > Thanks,
> > >
> > > With hugetlbfs there are less bits to test: e.g. with 2M pages a single
> > > bit set marks 512 pages as dirty. We do not take advantage of this
> > > but it looks like a rather obvious optimization.
> >
> > Right, but isn't that the trade-off between granularity of dirty
> > tracking and how easy it is to collect the dirty bits? Say, it'll be
> > merely impossible to migrate 1G-huge-page-backed guests if we track
> > dirty bits using huge page granularity, since each touch of guest
> > memory will cause another 1G memory to be transferred even if most of
> > the content is the same. 2M can be somewhere in the middle, but still
> > the same write amplify issue exists.
> >
>
> OK I see I'm unclear.
>
> IIUC at the moment KVM never uses huge pages if any part of the huge page is
> tracked.

To be more precise - I think it's per-memslot. Say, if the memslot is
dirty tracked, then no huge page on the host on that memslot (even if
guest used huge page over that).

> But if all parts of the page are written to then huge page
> is used.

I'm not sure of this... I think it's still in 4K granularity.

>
> In this situation the whole huge page is dirty and needs to be migrated.

Note that in QEMU we always migrate pages in 4K for x86, iiuc (please
refer to ram_save_host_page() in QEMU).

>
> > PS. that seems to be another topic after all besides the dirty ring
> > series because we need to change our policy first if we want to track
> > it with huge pages; with that, for dirty ring we can start to leverage
> > the kvm_dirty_gfn.pad to store the page size with another new kvm cap
> > when we really want.
> >
> > Thanks,
>
> Seems like leaking implementation detail to UAPI to me.

I'd say it's not the only place we have an assumption at least (please
also refer to uffd_msg.pagefault.address). IMHO it's not something
wrong because interfaces can be extended, but I am open to extending
kvm_dirty_gfn to cover a length/size or make the pad larger (as long
as Paolo is fine with this).

Thanks,

--
Peter Xu