Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration

From: Yan Zhao
Date: Wed Aug 16 2023 - 01:42:49 EST


On Wed, Aug 16, 2023 at 11:44:29AM +0800, bibo mao wrote:
>
>
> 在 2023/8/16 10:43, bibo mao 写道:
> >
> >
> > 在 2023/8/15 22:50, Sean Christopherson 写道:
> >> On Tue, Aug 15, 2023, Yan Zhao wrote:
> >>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
> >>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> >>>>>>
> >>>>>> Compile tested only.
> >>>>>
> >>>>> I don't find a matching end to each
> >>>>> mmu_notifier_invalidate_range_start_nonblock().
> >>>>
> >>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
> >>>>
> >>>> if (range.start)
> >>>> mmu_notifier_invalidate_range_end(&range);
> >>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
> >>> if we only want the range to include pages successfully set to PROT_NONE.
> >>
> >> Precise invalidation was a non-goal for my hack-a-patch. The intent was purely
> >> to defer invalidation until it was actually needed, but still perform only a
> >> single notification so as to batch the TLB flushes, e.g. the start() call still
> >> used the original @end.
> >>
> >> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
> >> It was complete untested though, so it may not have actually done anything to reduce
> >> the number of pointless invalidations.
> > For numa-balance scenery, can original page still be used by application even if pte
> > is changed with PROT_NONE? If it can be used, maybe we can zap shadow mmu and flush tlb
For GUPs that does not honor FOLL_HONOR_NUMA_FAULT, yes,

See https://lore.kernel.org/all/20230803143208.383663-1-david@xxxxxxxxxx/

> Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with
> new page, my meaning that can original page still be used by application even if pte
> is changed with PROT_NONE and before replaced with new page?
It's not .change_pte() notification, which is sent when COW.
The do_numa_page()/do_huge_pmd_numa_page() will try to unmap old page
protected with PROT_NONE, and if every check passes, a separate
.invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
sent.

So, I think KVM (though it honors FOLL_HONOR_NUMA_FAULT), can safely
keep mapping maybe-dma pages until MMU_NOTIFY_CLEAR is sent.
(this approach is implemented in RFC v1
https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@xxxxxxxxx/)

>
> And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and
> after mmu_notifier_invalidate_range_end notification for secondary mmu.
> Regards
> Bibo Mao

> >> in notification mmu_notifier_invalidate_range_end with precised range, the range can
But I don't think flush tlb only in the .invalidate_range_end() in
secondary MMU is a good idea.
Flush must be done before kvm->mmu_lock is unlocked, otherwise,
confusion will be caused when multiple threads trying to update the
secondary MMU.

> >> be cross-range between range mmu_gather and mmu_notifier_range.