Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration

From: Sean Christopherson
Date: Tue Aug 08 2023 - 13:11:32 EST


On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > !is_last_spte(iter.old_spte, iter.level))
> > continue;
> >
> > + if (skip_pinned) {
> > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > + struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > + struct folio *folio;
> > +
> > + if (!page)
> > + continue;
> > +
> > + folio = page_folio(page);
> > +
> > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > + folio_maybe_dma_pinned(folio))
> > + continue;
> > + }
> > +
>
> I don't get it..
>
> The last patch made it so that the NUMA balancing code doesn't change
> page_maybe_dma_pinned() pages to PROT_NONE
>
> So why doesn't KVM just check if the current and new SPTE are the same
> and refrain from invalidating if nothing changed?

Because KVM doesn't have visibility into the current and new PTEs when the zapping
occurs. The contract for invalidate_range_start() requires that KVM drop all
references before returning, and so the zapping occurs before change_pte_range()
or change_huge_pmd() have done antyhing.

> Duplicating the checks here seems very frail to me.

Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned()
can yield different results purely based on refcounts, i.e. KVM could skip pages
that the primary MMU does not, and thus violate the mmu_notifier contract. And
in general, I am steadfastedly against adding any kind of heuristic to KVM's
zapping logic.

This really needs to be fixed in the primary MMU and not require any direct
involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
to be skipped.