Re: [PATCH RFC 00/15] KVM: Dirty ring interface

From: Peter Xu
Date: Thu Dec 05 2019 - 15:52:24 EST


On Thu, Dec 05, 2019 at 08:59:33PM +0100, Paolo Bonzini wrote:
> On 05/12/19 20:30, Peter Xu wrote:
> >> Try enabling kvmmmu tracepoints too, it will tell
> >> you more of the path that was taken while processing the EPT violation.
> >
> > These new tracepoints are extremely useful (which I didn't notice
> > before).
>
> Yes, they are!

(I forgot to say thanks for teaching me that! :)

>
> > So here's the final culprit...
> >
> > void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> > {
> > ...
> > spin_lock(&kvm->mmu_lock);
> > /* FIXME: we should use a single AND operation, but there is no
> > * applicable atomic API.
> > */
> > while (mask) {
> > clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> > mask &= mask - 1;
> > }
> >
> > kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> > spin_unlock(&kvm->mmu_lock);
> > }
> >
> > The mask is cleared before reaching
> > kvm_arch_mmu_enable_log_dirty_pt_masked()..
>
> I'm not sure why that results in two vmexits? (clearing before
> kvm_arch_mmu_enable_log_dirty_pt_masked is also what
> KVM_{GET,CLEAR}_DIRTY_LOG does).

Sorry my fault to be not clear on this.

The kvm_arch_mmu_enable_log_dirty_pt_masked() only explains why the
same page is not written again after the ring-full userspace exit
(which triggered the real dirty bit missing), and that's because the
write bit is not removed during KVM_RESET_DIRTY_RINGS so the next
vmenter will directly write to the previous page without vmexit.

The two vmexits is another story - I tracked it is retried because
mmu_notifier_seq has changed, hence it goes through this path:

if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
goto out_unlock;

It's because when try_async_pf(), we will do a writable page fault,
which probably triggers both the invalidate_range_end and change_pte
notifiers. A reference trace when EPT enabled:

kvm_mmu_notifier_change_pte+1
__mmu_notifier_change_pte+82
wp_page_copy+1907
do_wp_page+478
__handle_mm_fault+3395
handle_mm_fault+196
__get_user_pages+681
get_user_pages_unlocked+172
__gfn_to_pfn_memslot+290
try_async_pf+141
tdp_page_fault+326
kvm_mmu_page_fault+115
kvm_arch_vcpu_ioctl_run+2675
kvm_vcpu_ioctl+536
do_vfs_ioctl+1029
ksys_ioctl+94
__x64_sys_ioctl+22
do_syscall_64+91

I'm not sure whether that's ideal, but it makes sense to me.

>
> > The funny thing is that I did have a few more patches to even skip
> > allocate the dirty_bitmap when dirty ring is enabled (hence in that
> > tree I removed this while loop too, so that has no such problem).
> > However I dropped those patches when I posted the RFC because I don't
> > think it's mature, and the selftest didn't complain about that
> > either.. Though, I do plan to redo that in v2 if you don't disagree.
> > The major question would be whether the dirty_bitmap could still be
> > for any use if dirty ring is enabled.
>
> Userspace may want a dirty bitmap in addition to a list (for example:
> list for migration, bitmap for framebuffer update), but it can also do a
> pass over the dirty rings in order to update an internal bitmap.
>
> So I think it make sense to make it either one or the other.

Ok, then I'll do.

Thanks,

--
Peter Xu