Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state

From: Sean Christopherson
Date: Fri May 26 2023 - 12:10:49 EST


On Fri, May 26, 2023, Yan Zhao wrote:
> On Thu, May 25, 2023 at 07:53:41AM -0700, Sean Christopherson wrote:
> > > > > > So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> > > > > > then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> > > > > > and will zap when CR0.CD is cleared. And to avoid regressing the CR0.CD case,
> > > > > > if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> > > > > > zap non-WB MTRR ranges when CR0.CD is cleared. Since WB is weak, looking for
> > > Not only non-WB ranges are required to be zapped.
> > > Think about a scenario:
> > > (1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too.
> > > (2) after CR0.CD=1, without zap, its EPT entry memtype is still UC.
> > > (3) then guest performs MTRR disable, no zap too.
> > > (4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
> > > (5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.
> >
> > KVM installs WB memtype when the quirk is enabled, thus no need to zap. KVM
> > currently zaps everything when the quirk is disabled, and I'm not proposing that
> > be changed.
> I didn't explain it clearly.
>
> (1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too. ==> EPT entry has been created here
> (2) after CR0.CD=1, because of the quirk, no zap, the created EPT entry memtype is still UC.
> (3) then guest performs MTRR disable, no zap too, according to our change.
> (4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
> (5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.==>we also need to zap WB range.

Ugh, right. But that case can be handled by zapping non-WB ranges on CR0.CD being
set. Hmm, and KVM would need to zap everything if CR0.CD is toggled with MTRRs
disabled, but I don't think OVMF ever does that. Zapping on CR0.CD toggling would
would likely introduce a small performance hit for SeaBIOS due to SeaBIOS clearing
CR0.CD immediately after SIPI, i.e. with MTRRs disabled, but that's arguably a
correctness fix since the quirk means KVM incorrectly runs the vCPU with WB SPTEs
until MTRRs are programmed.

With precise+batched zapping, zapping non-WB ranges even when CR0.CD is set should
still be a healthy performance boost for OVMF.

> > It doesn't actually require non-coherent DMA within the CR0.CD=1 window though.
> If we don't need to mind non-coherent DMA within the window CR0.CD=1 to CR0.CD=0,
> do you think it's a good idea to do in this way...
>
> (1) when CR0.CD=1, return guest mtrr type.
>
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7532,12 +7532,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>
> if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
> - if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> - cache = MTRR_TYPE_WRBACK;
> - else
> + if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
> + cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
> + return cache << VMX_EPT_MT_EPTE_SHIFT;
> + } else {
> cache = MTRR_TYPE_UNCACHABLE;
> -
> - return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> + return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> + }
> }
>
>
> (2) return MTRR_TYPE_WRBACK if guest mtrr has not been enabled for once
> u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
> @@ -628,13 +635,23 @@ u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
> struct mtrr_iter iter;
> u64 start, end;
> int type = -1;
> const int wt_wb_mask = (1 << MTRR_TYPE_WRBACK)
> | (1 << MTRR_TYPE_WRTHROUGH);
>
> + if (!mtrr_state->enabled_once)
> + return MTRR_TYPE_WRBACK;

No, because this assumes too many things about the guest, and completely falls
apart if the guest reboots.

> (3) when CR0.CD = 1 and the quirk is on, return MTRR type as if MTRR is enabled
> + ignore_disable = kvm_read_cr0_bits(vcpu, X86_CR0_CD) &&
> + kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED);
>
> -static void mtrr_lookup_start(struct mtrr_iter *iter)
> +static void mtrr_lookup_start(struct mtrr_iter *iter, bool ignore_disable)
> {
> - if (!mtrr_is_enabled(iter->mtrr_state)) {
> + if (!ignore_disable && !mtrr_is_enabled(iter->mtrr_state)) {
> iter->mtrr_disabled = true;
> return;
> }
>
> (4) Then we only need to do EPT zap when MTRR is enabled for the first time
> and when MTRR fixed/var entries are changed at enabling MTRR or at CR0.CD=0
> (I prefer at enabling MTRR, as seabios may do the MTRR disable/update/enable when
> CR0.CD=0)
>
>
> Besides, accoding to the commit message introducing KVM_QUIRK_CD_NW_CLEARED,

I am not willing to lean on the historical commit messages for the quirk to
justify any change. I'm not at all convinced that there was any meaningful thought
put into ensuring correctness.

> we can return MTRR_TYPE_WRBACK for CR0.CD=1 only when MTRR is not enbled for
> once. (And I tried, it works!)

On your system, for your setup. The quirk terrifies me because it likely affects
every KVM-based VM out there (I can't find a single VMM that disables the quirk).
These changes are limited to VMs with noncoherent DMA, but still.

In short, I am steadfastly against any change that piles more arbitrary behavior
functional behavior on top of the quirk, especially when that behavior relies on
heuristics to try and guess what the guest is doing.