Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration

From: Jason Gunthorpe
Date: Tue Aug 08 2023 - 11:55:42 EST


On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > > !is_last_spte(iter.old_spte, iter.level))
> > > continue;
> > >
> > > + if (skip_pinned) {
> > > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > + struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > + struct folio *folio;
> > > +
> > > + if (!page)
> > > + continue;
> > > +
> > > + folio = page_folio(page);
> > > +
> > > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > + folio_maybe_dma_pinned(folio))
> > > + continue;
> > > + }
> > > +
> >
> > I don't get it..
> >
> > The last patch made it so that the NUMA balancing code doesn't change
> > page_maybe_dma_pinned() pages to PROT_NONE
> >
> > So why doesn't KVM just check if the current and new SPTE are the same
> > and refrain from invalidating if nothing changed?
>
> Because KVM doesn't have visibility into the current and new PTEs when the zapping
> occurs. The contract for invalidate_range_start() requires that KVM drop all
> references before returning, and so the zapping occurs before change_pte_range()
> or change_huge_pmd() have done antyhing.
>
> > Duplicating the checks here seems very frail to me.
>
> Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned()
> can yield different results purely based on refcounts, i.e. KVM could skip pages
> that the primary MMU does not, and thus violate the mmu_notifier contract. And
> in general, I am steadfastedly against adding any kind of heuristic to KVM's
> zapping logic.
>
> This really needs to be fixed in the primary MMU and not require any direct
> involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> to be skipped.

This likely has the same issue you just described, we don't know if it
can be skipped until we iterate over the PTEs and by then it is too
late to invoke the notifier. Maybe some kind of abort and restart
scheme could work?

Jason