Re: [PATCH v4] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS

From: Sean Christopherson
Date: Thu Jun 01 2023 - 13:43:23 EST


On Wed, May 31, 2023, Jon Kohler wrote:
>
> > On May 31, 2023, at 2:30 PM, Jim Mattson <jmattson@xxxxxxxxxx> wrote:
> >
> > On Wed, May 31, 2023 at 11:17 AM Jon Kohler <jon@xxxxxxxxxxx> wrote:
> >> Yea, I thought about it. One one hand, simplicity is kingand on the other
> >> hand, not having to think about this again is nice too.
> >>
> >> The challenge in my mind is that on setups where this truly is static, we’d
> >> be taking some incremental amount of memory to keep the counter around,

Not really. The vCPU structures are already order-2 allocations, increasing the
size by 8-16 bytes doesn't affect the actual memory usage in practice. Death by
a thousand cuts is a potential problem, but we're a ways away from crossing back
over into order-3 allocations.

> >> just to have the same outcome each time. Doesn’t feel right (to me) unless that is
> >> also used for “other” stuff as some sort of general purpose/common counter.

...

> Yes, there is places this could be stuffed I’m sure. Still feels a bit heavy
> handed for the same-outcome-every-time situations though.

There's no guarantee the outcome will be the same. You're assuming that (a) the
guest is eIBRS aware, (b) SPEC_CTRL doesn't get extended for future mitigations,
and (c) that if L1 is running VMs of its own, that L1 is advertising eIBRS to L2
and that the L2 kernel is also aware of eIBRS.

> >> RE Cost: I can’t put my finger on it, but I swear that RDMSR for *this*
> >> specific MSR is more expensive than any other RDMSR I’ve come across
> >> for run-of-the-mill random MSRs. I flipped thru the SDM and the mitigations
> >> documentation, and it only ever mentions that there is a notable cost to
> >> do WRMSR IA32_SPEC_CTRL, but nothing about the RDMSR side.
> >>
> >> If anyone happens to know from an Intel-internals perspective, I’d be quite
> >> interested to know why it just “feels” so darn costly. i.e. is the proc also doing
> >> special things under the covers, similar to what the processor does on
> >> writes to this one?
> >
> > What do you mean by "feels"? Have you measured it?
>
> There are plenty of rdmsr’s scattered around the entry and exit paths that get
> hit every time, but this is far and away always the most expensive one when
> profiling with perf top. I haven’t measured it separately from the existing code,
> But rather noted during profiling that it appears to be nastier than others.
>
> I’m more curious than anything else, but it doesn’t matter all that much going
> forward since this commit will nuke it from orbit for the run of the mill
> eIBRS-only use cases.

As above, you're making multiple assumptions that may or may not hold true. I
agree with Jim, reacting to what the guest is actually doing is more robust than
assuming the guest will do XYZ based on the vCPU model or some other heuristic.

The code isn't that complex, and KVM can even reuse the number of exits snapshot
to periodically re-enable the intercept, e.g. to avoid unnecessary RDMSRs if the
vCPU stops writing MSR_IA32_SPEC_CTRL for whatever reason.

Needs actual testing and informed magic numbers, but I think this captures the
gist of what Jim is suggesting.

---
arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/svm/svm.c | 22 ++++++++--------------
arch/x86/kvm/vmx/vmx.c | 28 ++++++++++------------------
arch/x86/kvm/x86.h | 24 ++++++++++++++++++++++++
4 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fb9d1f2d6136..3fdb6048cd58 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -966,6 +966,9 @@ struct kvm_vcpu_arch {
/* Host CPU on which VM-entry was most recently attempted */
int last_vmentry_cpu;

+ u32 nr_quick_spec_ctrl_writes;
+ u64 spec_ctrl_nr_exits_snapshot;
+
/* AMD MSRC001_0015 Hardware Configuration */
u64 msr_hwcr;

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index ca32389f3c36..f749613204d3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2959,21 +2959,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
svm->vmcb->save.spec_ctrl = data;
else
svm->spec_ctrl = data;
- if (!data)
- break;

- /*
- * For non-nested:
- * When it's written (to non-zero) for the first time, pass
- * it through.
- *
- * For nested:
- * The handling of the MSR bitmap for L2 guests is done in
- * nested_svm_vmrun_msrpm.
- * We update the L1 MSR bit as well since it will end up
- * touching the MSR anyway now.
- */
- set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
+ if (!msr->host_initiated &&
+ kvm_account_msr_spec_ctrl_write(vcpu))
+ set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
break;
case MSR_AMD64_VIRT_SPEC_CTRL:
if (!msr->host_initiated &&
@@ -4158,6 +4147,11 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)

svm_complete_interrupts(vcpu);

+ if (!static_cpu_has(X86_FEATURE_V_SPEC_CTRL) &&
+ !spec_ctrl_intercepted &&
+ kvm_account_msr_spec_ctrl_passthrough(vcpu))
+ set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 0, 0);
+
if (is_guest_mode(vcpu))
return EXIT_FASTPATH_NONE;

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 44fb619803b8..4f4a2c3507bc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2260,24 +2260,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;

vmx->spec_ctrl = data;
- if (!data)
- break;

- /*
- * For non-nested:
- * When it's written (to non-zero) for the first time, pass
- * it through.
- *
- * For nested:
- * The handling of the MSR bitmap for L2 guests is done in
- * nested_vmx_prepare_msr_bitmap. We should not touch the
- * vmcs02.msr_bitmap here since it gets completely overwritten
- * in the merging. We update the vmcs01 here for L1 as well
- * since it will end up touching the MSR anyway now.
- */
- vmx_disable_intercept_for_msr(vcpu,
- MSR_IA32_SPEC_CTRL,
- MSR_TYPE_RW);
+ if (msr_info->host_initiated &&
+ kvm_account_msr_spec_ctrl_write(vcpu))
+ vmx_disable_intercept_for_msr(vcpu, MSR_IA32_SPEC_CTRL,
+ MSR_TYPE_RW);
break;
case MSR_IA32_TSX_CTRL:
if (!msr_info->host_initiated &&
@@ -7192,6 +7179,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
+ unsigned int run_flags = __vmx_vcpu_run_flags(vmx);
unsigned long cr3, cr4;

/* Record the guest's net vcpu time for enforced NMI injections. */
@@ -7280,7 +7268,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
kvm_wait_lapic_expire(vcpu);

/* The actual VMENTER/EXIT is in the .noinstr.text section. */
- vmx_vcpu_enter_exit(vcpu, __vmx_vcpu_run_flags(vmx));
+ vmx_vcpu_enter_exit(vcpu, run_flags);

/* All fields are clean at this point */
if (kvm_is_using_evmcs()) {
@@ -7346,6 +7334,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
vmx_recover_nmi_blocking(vmx);
vmx_complete_interrupts(vmx);

+ if ((run_flags & VMX_RUN_SAVE_SPEC_CTRL) &&
+ kvm_account_msr_spec_ctrl_passthrough(vcpu))
+ vmx_enable_intercept_for_msr(vcpu, MSR_IA32_SPEC_CTRL, MSR_TYPE_RW);
+
if (is_guest_mode(vcpu))
return EXIT_FASTPATH_NONE;

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index c544602d07a3..454bcbf5b543 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -492,7 +492,31 @@ static inline void kvm_machine_check(void)

void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
+
int kvm_spec_ctrl_test_value(u64 value);
+
+static inline bool kvm_account_msr_spec_ctrl_write(struct kvm_vcpu *vcpu)
+{
+ if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 20)
+ vcpu->arch.nr_quick_spec_ctrl_writes++;
+ else
+ vcpu->arch.nr_quick_spec_ctrl_writes = 0;
+
+ vcpu->arch.spec_ctrl_nr_exits_snapshot = vcpu->stat.exits;
+
+ return vcpu->arch.nr_quick_spec_ctrl_writes >= 10;
+}
+
+static inline bool kvm_account_msr_spec_ctrl_passthrough(struct kvm_vcpu *vcpu)
+{
+ if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 100000)
+ return false;
+
+ vcpu->arch.spec_ctrl_nr_exits_snapshot = vcpu->stat.exits;
+ vcpu->arch.nr_quick_spec_ctrl_writes = 0;
+ return true;
+}
+
bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,
struct x86_exception *e);

base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f
--