Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

From: Sean Christopherson
Date: Mon May 22 2023 - 13:44:00 EST


On Fri, May 19, 2023, Pawan Gupta wrote:
> On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
> > I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL...
> >
> > 1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill
> > buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every
> > entry+exit.
>
> Unnecessary VERWs in guest will have much higher impact than due to MSR
> read/write at vmentry/exit.

Can you provide numbers for something closeish to a real world workload?

> On an Icelake system it is pointless for a guest to incur VERW penalty when
> the system is not affected by MDS/TAA and guests don't need mitigation for
> MMIO Stale Data. MSR writes are only done when the guest is likely to execute
> unnecessary VERWs(e.g. when the guest thinks its running on an older gen
> CPU).
>
> > KVM just needs to context switch the MSR between guests since the value that's
> > loaded while running in the host is irrelevant. E.g. use a percpu cache to
>
> I will be happy to avoid the MSR read/write, but its worth considering
> that this MSR can receive more bits that host may want to toggle, then
> percpu cache implementation would likely change.

Change in and of itself isn't problematic, so long as whatever code we write won't
fall over if/when new bits are added, i.e. doesn't clobber unknown bits.

> > 5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up,
> > i.e. the host's desired value is effectively static post-boot, and barring
> > a buggy configuration (running KVM as a guest), the boot CPU's value will be
> > the same as every other CPU.
>
> Would the MSR value be same on every CPU, if only some guests have
> enumerated FB_CLEAR and others haven't?

Ignore the guest, I'm talking purely about the host. Specifically, there's no
reason to do a RDMSR to get the host value on every VM-Enter since the host's
value is effectively static post-boot.

> MSR writes (to disable FB_CLEAR) are not done when a guest enumerates
> FB_CLEAR. Enumeration of FB_CLEAR in guest will depend on its configuration.
>
> > 6. Performance aside, KVM should not be speculating (ha!) on what the guest
> > will and will not do, and should instead honor whatever behavior is presented
> > to the guest. If the guest CPU model indicates that VERW flushes buffers,
> > then KVM damn well needs to let VERW flush buffers.
>
> The current implementation allows guests to have VERW flush buffers when
> they enumerate FB_CLEAR. It only restricts the flush behavior when the
> guest is trying to mitigate against a vulnerability(like MDS) on a
> hardware that is not affected. I guess its common for guests to be
> running with older gen configuration on a newer hardware.

Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest
the guest will do things a certain way and should instead honor the "architectural"
definition, in quotes because I realize there probably is no architectural
definition for any of this.

It might be that the code does (unintentionally?) honor the "architecture", i.e.
this code might actually be accurrate with respect to when the guest can expect
VERW to flush buffers. But the comment is so, so wrong.

/*
* If guest will not execute VERW, there is no need to set FB_CLEAR_DIS
* at VMEntry. Skip the MSR read/write when a guest has no use case to
* execute VERW.
*/
if ((vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR) ||
((vcpu->arch.arch_capabilities & ARCH_CAP_MDS_NO) &&
(vcpu->arch.arch_capabilities & ARCH_CAP_TAA_NO) &&
(vcpu->arch.arch_capabilities & ARCH_CAP_PSDP_NO) &&
(vcpu->arch.arch_capabilities & ARCH_CAP_FBSDP_NO) &&
(vcpu->arch.arch_capabilities & ARCH_CAP_SBDR_SSDP_NO)))
vmx->disable_fb_clear = false;