Re: [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map

From: Paolo Bonzini
Date: Thu Apr 01 2021 - 13:45:47 EST


On 01/04/21 18:50, Ben Gardon wrote:
retry:
if (is_shadow_present_pte(iter.old_spte)) {
if (is_large_pte(iter.old_spte)) {
if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
break;

/*
* The iter must explicitly re-read the SPTE because
* the atomic cmpxchg failed.
*/
iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
goto retry;
}
} else {
...
}

?
To be honest, that feels less readable to me. For me retry implies
that we failed to make progress and need to repeat an operation, but
the reality is that we did make progress and there are just multiple
steps to replace the large SPTE with a child PT.

You're right, it's makes no sense---I misremembered the direction of
tdp_mmu_zap_spte_atomic's return value. I was actually thinking of this:

Another option which could improve readability and performance would
be to use the retry to repeat failed cmpxchgs instead of breaking out
of the loop. Then we could avoid retrying the page fault each time a
cmpxchg failed, which may happen a lot as vCPUs allocate intermediate
page tables on boot. (Probably less common for leaf entries, but
possibly useful there too.)

which would be

retry:
if (is_shadow_present_pte(iter.old_spte)) {
if (is_large_pte(iter.old_spte) &&
!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter)) {
/*
* The iter must explicitly re-read the SPTE because
* the atomic cmpxchg failed.
*/
iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
goto retry;
}
/* XXX move this to tdp_mmu_zap_spte_atomic? */
iter.old_spte = 0;
} else {
continue;
}
}
sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
child_pt = sp->spt;

new_spte = make_nonleaf_spte(child_pt,
!shadow_accessed_mask);

if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter,
new_spte)) {
tdp_mmu_free_sp(sp);
/*
* The iter must explicitly re-read the SPTE because
* the atomic cmpxchg failed.
*/
iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
goto retry;
}
tdp_mmu_link_page(vcpu->kvm, sp, true,
huge_page_disallowed &&
req_level >= iter.level);

trace_kvm_mmu_get_page(sp, true);

which survives at least a quick smoke test of booting a 20-vCPU Windows
guest. If you agree I'll turn this into an actual patch.