Re: [RFC PATCH 4/6] KVM: X86: Introduce role.level_promoted

From: Sean Christopherson
Date: Tue Jan 04 2022 - 17:14:26 EST


On Fri, Dec 10, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
>
> Level pormotion occurs when mmu->shadow_root_level > mmu->root_level.

s/pormotion/promotion (in multiple places)

That said, I strongly prefer "passthrough" or maybe "nop" over "level_promoted".
Promoted in the context of page level usually refers to making a hugepage, which
is not the case here.

>
> There are several cases that can cuase level pormotion:
>
> shadow mmu (shadow paging for 32 bit guest):
> case1: gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0
>
> shadow nested NPT (for 32bit L1 hypervisor):
> case2: gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0,hEFER_LMA=0
> case3: gCR0_PG=1,gEFER_LMA=0,hEFER_LMA=1
>
> shadow nested NPT (for 64bit L1 hypervisor):
> case4: gEFER_LMA=1,gCR4_LA57=0,hEFER_LMA=1,hCR4_LA57=1
>
> When level pormotion occurs (32bit guest, case1-3), special roots are
> often used. But case4 is not using special roots. It uses shadow page
> without fully aware of the specialty. It might work accidentally:
> 1) The root_page (root_sp->spt) is allocated with level = 5,
> and root_sp->spt[0] is allocated with the same gfn and the
> same role except role.level = 4. Luckly that they are
> different shadow pages.
> 2) FNAME(walk_addr_generic) sets walker->table_gfn[4] and
> walker->pt_access[4], which are normally unused when
> mmu->shadow_root_level == mmu->root_level == 4, so that
> FNAME(fetch) can use them to allocate shadow page for
> root_sp->spt[0] and link them when shadow_root_level == 5.
>
> But it has problems.
> If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
> and usees the same gfn as the root for the nNPT before and after
> switching gCR4_LA57. The host (hCR4_LA57=1) wold use the same root_sp
> for guest even guest switches gCR4_LA57. The guest will see unexpected
> page mapped and L2 can hurts L1. It is lucky the the problem can't
> hurt L0.
>
> The root_sp should be like role.direct=1 sometimes: its contents are
> not backed by gptes, root_sp->gfns is meaningless. For a normal high
> level sp, sp->gfns is often unused and kept zero, but it could be
> relevant and meaningful when sp->gfns is used because they are back
> by concret gptes. For level-promoted root_sp described before, root_sp
> is just a portal to contribute root_sp->spt[0], and root_sp should not
> have root_sp->gfns and root_sp->spt[0] should not be dropped if gpte[0]
> of the root gfn is changed.
>
> This patch adds role.level_promoted to address the two problems.
> role.level_promoted is set when shadow paging and
> role.level > gMMU.level.
>
> An alternative way to fix the problem of case4 is that: also using the
> special root pml5_root for it. But it would required to change many
> other places because it is assumption that special roots is only used
> for 32bit guests.
>
> This patch also paves the way to use level promoted shadow page for
> case1-3, but that requires the special handling or PAE paging, so the
> extensive usage of it is not included.
>
> Signed-off-by: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
> ---
> arch/x86/include/asm/kvm_host.h | 3 ++-
> arch/x86/kvm/mmu/mmu.c | 15 +++++++++++++--
> arch/x86/kvm/mmu/paging_tmpl.h | 1 +
> 3 files changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 88ecf53f0d2b..6465c83794fc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -334,7 +334,8 @@ union kvm_mmu_page_role {
> unsigned smap_andnot_wp:1;
> unsigned ad_disabled:1;
> unsigned guest_mode:1;
> - unsigned :6;
> + unsigned level_promoted:1;
> + unsigned :5;

The massive comment above this union needs to be updated.

> /*
> * This is left at the top of the word so that
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 54e7cbc15380..4769253e9024 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -767,6 +767,9 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
>
> static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
> {
> + if (sp->role.level_promoted)
> + return sp->gfn;
> +
> if (!sp->role.direct)
> return sp->gfns[index];
>
> @@ -776,6 +779,8 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
> static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
> {
> if (!sp->role.direct) {
> + if (WARN_ON_ONCE(sp->role.level_promoted && gfn != sp->gfn))
> + return;
> sp->gfns[index] = gfn;

This is wrong, sp->gfns is NULL when sp->role.level_promoted is true. I believe
you want something similar to the "get" flow, e.g.

if (sp->role.passthrough) {
WARN_ON_ONCE(gfn != sp->gfn);
return;
}

if (!sp->role.direct) {
...
}

Alternatively, should we mark passthrough shadow pages as direct=1? That would
naturally handle this code, and for things like reexecute_instruction()'s usage
of kvm_mmu_unprotect_page(), I don't think passthrough shadow pages should be
considered indirect, e.g. zapping them won't help and the shadow page can't become
unsync.

> return;
> }
> @@ -1702,7 +1707,7 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
> hlist_del(&sp->hash_link);
> list_del(&sp->link);
> free_page((unsigned long)sp->spt);
> - if (!sp->role.direct)
> + if (!sp->role.direct && !sp->role.level_promoted)

Hmm, at this point, it may be better to just omit the check entirely. @sp is
zero allocated and free_page() plays nice with NULL "pointers". I believe TDX
and maybe SNP support will need to do more work here, i.e. may add back a similar
check, but if so then checking sp->gfns directly for !NULL is preferable.

> free_page((unsigned long)sp->gfns);
> kmem_cache_free(mmu_page_header_cache, sp);
> }
> @@ -1740,7 +1745,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>
> sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> - if (!role.direct)
> + if (!(role.direct || role.level_promoted))

I personally prefer the style of the previous check, mainly because I'm horrible
at reading !(x || y).

if (!role.direct && !role.passthrough)

> sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
> set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>
> @@ -2084,6 +2089,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
> role.quadrant = quadrant;
> }
> + if (role.level_promoted && (level <= vcpu->arch.mmu->root_level))
> + role.level_promoted = 0;
>
> sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> @@ -4836,6 +4843,8 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
>
> role.base.direct = false;
> role.base.level = kvm_mmu_get_tdp_level(vcpu);
> + if (role.base.level > role_regs_to_root_level(regs))
> + role.base.level_promoted = 1;
>
> return role;
> }
> @@ -5228,6 +5237,8 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
> kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
>
> for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
> + if (sp->role.level_promoted)
> + continue;
> if (detect_write_misaligned(sp, gpa, bytes) ||
> detect_write_flooding(sp)) {
> kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 5c78300fc7d9..16ac276d342a 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1043,6 +1043,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> .level = 0xf,
> .access = 0x7,
> .quadrant = 0x3,
> + .level_promoted = 0x1,
> };
>
> /*
> --
> 2.19.1.6.gb485710b
>