Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before link it

From: Sean Christopherson
Date: Fri Sep 03 2021 - 12:06:37 EST


On Fri, Sep 03, 2021, Lai Jiangshan wrote:
>
> On 2021/9/3 07:54, Sean Christopherson wrote:
> >
> > trace_get_page:
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index 50ade6450ace..5b13918a55c2 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -704,6 +704,10 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > access = gw->pt_access[it.level - 2];
> > sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
> > it.level-1, false, access);
> > + if (sp->unsync_children) {
> > + kvm_make_all_cpus_request(KVM_REQ_MMU_SYNC, vcpu);
> > + return RET_PF_RETRY;
>
> Making KVM_REQ_MMU_SYNC be able remotely is good idea.
> But if the sp is not linked, the @sp might not be synced even we
> tried many times. So we should continue to link it.

Hrm, yeah. The sp has to be linked in at least one mmu, but it may not be linked
in the current mmu, so KVM would have to sync all roots across all current and
previous mmus in order to guarantee the target page is linked. Eww.

> But if we continue to link it, KVM_REQ_MMU_SYNC should be extended to
> sync all roots (current root and prev_roots). And maybe add a
> KVM_REQ_MMU_SYNC_CURRENT for current root syncing.
>
> It is not going to be a simple. I have a new way to sync pages
> and also fix the problem, but that include several non-fix patches.
>
> We need to fix this problem in the simplest way. In my patch
> mmu_sync_children() has a @root argument. I think we can disallow
> releasing the lock when @root is false. Is it OK?

With a caveat, it should work. I was exploring that option before the remote
sync idea.

The danger is inducing a stall in the host (RCU, etc...) if sp is an upper level
entry, e.g. with 5-level paging it can even be a PML4. My thought for that is to
skip the yield if there are less than N unsync children remaining, and then bail
out if the caller doesn't allow yielding. If mmu_sync_children() fails, restart
the guest and go through the entire page fault path. Worst case scenario, it will
take a "few" rounds for the vCPU to finally resolve the page fault.

Regarding params, please use "can_yield" instead of "root" to match similar logic
in the TDP MMU, and return an int instead of a bool.

Thanks!

---
arch/x86/kvm/mmu/mmu.c | 18 ++++++++++++------
arch/x86/kvm/mmu/paging_tmpl.h | 3 +++
2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4853c033e6ce..5be990cdb2be 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2024,8 +2024,8 @@ static void mmu_pages_clear_parents(struct mmu_page_path *parents)
} while (!sp->unsync_children);
}

-static void mmu_sync_children(struct kvm_vcpu *vcpu,
- struct kvm_mmu_page *parent)
+static int mmu_sync_children(struct kvm_vcpu *vcpu,
+ struct kvm_mmu_page *parent, bool can_yield)
{
int i;
struct kvm_mmu_page *sp;
@@ -2050,7 +2050,15 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
flush |= kvm_sync_page(vcpu, sp, &invalid_list);
mmu_pages_clear_parents(&parents);
}
- if (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock)) {
+ /*
+ * Don't yield if there are fewer than <N> unsync children
+ * remaining, just finish up and get out.
+ */
+ if (parent->unsync_children > SOME_ARBITRARY_THRESHOLD &&
+ (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock))) {
+ if (!can_yield)
+ return -EINTR;
+
kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
cond_resched_rwlock_write(&vcpu->kvm->mmu_lock);
flush = false;
@@ -2058,6 +2066,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
}

kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush);
+ return 0;
}

static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
@@ -2143,9 +2152,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
}

- if (sp->unsync_children)
- kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
-
__clear_sp_write_flooding_count(sp);

trace_get_page:
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 50ade6450ace..2ff123ec0d64 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -704,6 +704,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
access = gw->pt_access[it.level - 2];
sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
it.level-1, false, access);
+ if (sp->unsync_children &&
+ mmu_sync_children(vcpu, sp, false))
+ return RET_PF_RETRY;
}

/*
--