Re: [PATCH 3/3] KVM: x86/mmu: don't skip mmu initialization when mmu root level changes

From: Sean Christopherson
Date: Wed Nov 10 2021 - 12:21:21 EST

Next message: Andy Shevchenko: "Re: [PATCH v2 1/2] device property: Add fwnode_irq_get_byname()"
Previous message: H. Nikolaus Schaller: "Re: [PATCH v2 4/6] mmc: core: Fixup storing of OCR for MMC_QUIRK_NONSTD_SDIO"
In reply to: Maxim Levitsky: "Re: [PATCH 3/3] KVM: x86/mmu: don't skip mmu initialization when mmu root level changes"
Next in thread: Maxim Levitsky: "Re: [PATCH 3/3] KVM: x86/mmu: don't skip mmu initialization when mmu root level changes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Nov 10, 2021, Maxim Levitsky wrote:
> On Wed, 2021-11-10 at 15:48 +0100, Vitaly Kuznetsov wrote:
> > Maxim Levitsky <mlevitsk@xxxxxxxxxx> writes:
> >
> > > When running mix of 32 and 64 bit guests, it is possible to have mmu
> > > reset with same mmu role but different root level (32 bit vs 64 bit paging)

Hmm, no, it's far more nuanced than just "running a mix of 32 and 64 bit guests".
The changelog needs a much more in-depth explanation of exactly what and how
things go awry. I suspect that whatever bug is being hit is unique to the
migration path.

This needs a Fixes and Cc: stable@xxxxxxxxxxxxxxx, but which commit is fixed is
TBD.

Simply running 32 and 64 bit guests is not sufficient to cause problems. In
that case, they are different VMs entirely, where as the MMU context is per-vCPU.
So at minimum, this require running a mix of 32-bit and 64-bit _nested_ guests.

But even that isn't sufficient. Ignoring migration for the moment:

a) If shadow paging is enabled in L0, then EFER.LMA is captured in
kvm_mmu_page_role.level. Ergo this flaw doesn't affect legacy shadow paging.

b) If EPT is enabled in L0 _and_ L1, kvm_mmu_page_role.level tracks L1's EPT
level. Ergo, this flaw doesn't affect nested EPT.

c) If NPT is enabled in L0 _and_ L1, then this flaw can does apply as
kvm_mmu_page_role.level tracks L0's NPT level, which does not incorporate
L1's NPT level because of the limitations of NPT. But the cover letter
states these bugs are specific to VMX. I suspect that's incorrect and that
in theory this particular bug does apply to SVM, but that it hasn't been
hit due to not running with nNPT and both 32-bit and 64-bit L2s on a single
vCPU.

d) If TDP is enabled in L0 but _not_ L1, then L1 and L2 share an MMU context,
and that context is guaranteed to be reset on every nested transition due
to kvm_mmu_page_role.guest_mode. Ergo this flaw doesn't affect this combo
since it's impossible to switch between two L2 vCPUs on a single L1 vCPU
without a context reset.

The cover letter says:

The second issue is that L2 IA32_EFER makes L1's mmu be initialized incorrectly
(with PAE paging). This itself isn't an immediate problem as we are going into the L2,
but when we exit it, we don't reset the L1's mmu back to 64 bit mode because,
It so happens that the mmu role doesn't change and the 64 bitness isn't part of the mmu role.

But that should be impossible because of kvm_mmu_page_role.guest_mode. If that
doesn't trigger a reset, then presumably is_guest_mode() is stale?

Regarding (c), I believe this can be hit by running a 32-bit L2 and 64-bit L2 on
SVM with nNPT. I don't think that's a combination I've tested much, if at all.

Regarding (d), I believe the bug can rear its head if guest state is stuffed
via KVM_SET_SREGS{2}. kvm_mmu_reset_context() will re-init the MMU, but it
doesn't purge the previous context. I.e. the assumption that a switch between
"two" L2s can be broken.

> > > Signed-off-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx>
> > > ---
> > > arch/x86/kvm/mmu/mmu.c | 14 ++++++++++----
> > > 1 file changed, 10 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 354d2ca92df4d..763867475860f 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4745,7 +4745,10 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
> > > union kvm_mmu_role new_role =
> > > kvm_calc_tdp_mmu_root_page_role(vcpu, &regs, false);
> > >
> > > - if (new_role.as_u64 == context->mmu_role.as_u64)
> > > + u8 new_root_level = role_regs_to_root_level(&regs);
> > > +
> > > + if (new_role.as_u64 == context->mmu_role.as_u64 &&
> > > + context->root_level == new_root_level)
> > > return;
> >
> > role_regs_to_root_level() uses 3 things: CR0.PG, EFER.LMA and CR4.PAE
> > and two of these three are already encoded into extended mmu role
> > (kvm_calc_mmu_role_ext()). Could we achieve the same result by adding
> > EFER.LMA there?
>
> Absolutely. I just wanted your feedback on this to see if there is any reason
> to not do this.

Assuming my assessment above is correct, incorporating EFER.LMA into the
extended role is the correct fix. That will naturally do the right thing for
nested EPT in the sense that kvm_calc_shadow_ept_root_page_role() will ignore
EFER.LMA entirely.

> Also it seems that only basic role is compared here.

No, it's the full role. "as_u64" is the overlay for the combined base+ext.

union kvm_mmu_role {
u64 as_u64;
struct {
union kvm_mmu_page_role base;
union kvm_mmu_extended_role ext;
};
};

> I don't 100% know the reason why we have basic and extended roles - there is a
> comment about basic/extended mmu role to minimize the size of arch.gfn_track,
> but I haven't yet studied in depth why.

The "base" role is used to identify which individual shadow pages are compatible
with the current shadow/TDP paging context. Using TDP as an example, KVM can
reuse SPs at the correct level regardless of guest LA57. The gfn_track thing
tracks ever SP in existence for a given gfn. To minimize the number of possible
SPs, and thus minimize the storage capacity needed for gfn_track, the "base" role
is kept as tiny as possible.

> > > context->mmu_role.as_u64 = new_role.as_u64;
> > > @@ -4757,7 +4760,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
> > > context->get_guest_pgd = get_cr3;
> > > context->get_pdptr = kvm_pdptr_read;
> > > context->inject_page_fault = kvm_inject_page_fault;
> > > - context->root_level = role_regs_to_root_level(&regs);
> > > + context->root_level = new_root_level;
> > >
> > > if (!is_cr0_pg(context))
> > > context->gva_to_gpa = nonpaging_gva_to_gpa;
> > > @@ -4806,7 +4809,10 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
> > > struct kvm_mmu_role_regs *regs,
> > > union kvm_mmu_role new_role)
> > > {
> > > - if (new_role.as_u64 == context->mmu_role.as_u64)
> > > + u8 new_root_level = role_regs_to_root_level(regs);
> > > +
> > > + if (new_role.as_u64 == context->mmu_role.as_u64 &&
> > > + context->root_level == new_root_level)

Doesn't matter if EFER.LMA is added to the role, but this extra check shouldn't
be necessary for shadow paging as LMA is factored into role.base.level in that
case. TDP is problematic because role.base.level holds the TDP root level, which
doesn't depend on guest.EFER.LMA.

Nested EPT is also ok because shadow_root_level == root_level in that case.

> > > return;
> > >
> > > context->mmu_role.as_u64 = new_role.as_u64;
> > > @@ -4817,8 +4823,8 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
> > > paging64_init_context(context);
> > > else
> > > paging32_init_context(context);
> > > - context->root_level = role_regs_to_root_level(regs);
> > >
> > > + context->root_level = new_root_level;
> > > reset_guest_paging_metadata(vcpu, context);
> > > context->shadow_root_level = new_role.base.level;
>
>

Next message: Andy Shevchenko: "Re: [PATCH v2 1/2] device property: Add fwnode_irq_get_byname()"
Previous message: H. Nikolaus Schaller: "Re: [PATCH v2 4/6] mmc: core: Fixup storing of OCR for MMC_QUIRK_NONSTD_SDIO"
In reply to: Maxim Levitsky: "Re: [PATCH 3/3] KVM: x86/mmu: don't skip mmu initialization when mmu root level changes"
Next in thread: Maxim Levitsky: "Re: [PATCH 3/3] KVM: x86/mmu: don't skip mmu initialization when mmu root level changes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]