Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

From: Sean Christopherson
Date: Wed Oct 04 2023 - 17:50:52 EST


On Wed, Oct 04, 2023, Mingwei Zhang wrote:
> On Wed, Oct 4, 2023 at 4:22 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > > Or are you talking about the whole of vcpu_run() ? That seems like a
> > > > massive amount of code, and doesn't look like anything I'd call a
> > > > fast-path. Also, much of that loop has preemption enabled...
> > >
> > > The whole of vcpu_run(). And yes, much of it runs with preemption enabled. KVM
> > > uses preempt notifiers to context switch state if the vCPU task is scheduled
> > > out/in, we'd use those hooks to swap PMU state.
> > >
> > > Jumping back to the exception analogy, not all exits are equal. For "simple" exits
> > > that KVM can handle internally, the roundtrip is <1500. The exit_fastpath loop is
> > > roughly half that.
> > >
> > > But for exits that are more complex, e.g. if the guest hits the equivalent of a
> > > page fault, the cost of handling the page fault can vary significantly. It might
> > > be <1500, but it might also be 10x that if handling the page fault requires faulting
> > > in a new page in the host.
> > >
> > > We don't want to get too aggressive with moving stuff into the exit_fastpath loop,
> > > because doing too much work with IRQs disabled can cause latency problems for the
> > > host. This isn't much of a concern for slice-of-hardware setups, but would be
> > > quite problematic for other use cases.
> > >
> > > And except for obviously slow paths (from the guest's perspective), extra latency
> > > on any exit can be problematic. E.g. even if we got to the point where KVM handles
> > > 99% of exits the fastpath (may or may not be feasible), a not-fastpath exit at an
> > > inopportune time could throw off the guest's profiling results, introduce unacceptable
> > > jitter, etc.
> >
> > I'm confused... the PMU must not be running after vm-exit. It must not
> > be able to profile the host. So what jitter are you talking about?
> >
> > Even if we persist the MSR contents, the PMU itself must be disabled on
> > vm-exit and enabled on vm-enter. If not by hardware then by software
> > poking at the global ctrl msr.
> >
> > I also don't buy the latency argument, we already do full and complete
> > PMU rewrites with IRQs disabled in the context switch path. And as
> > mentioned elsewhere, the whole AMX thing has an 8k copy stuck in the FPU
> > save/restore.
> >
> > I would much prefer we keep the PMU swizzle inside the IRQ disabled
> > region of vcpu_enter_guest(). That's already a ton better than you have
> > today.

...

> Peter, that latency argument in pass-through implementation is
> something that we hope you could buy. This should be relatively easy
> to prove. I can provide some data if you need.

You and Peter are talking about is two different latencies. Or rather, how the
latency impacts two different things.

Peter is talking about is the latency impact on the host, specifically how much
work is done with IRQs disabled.

You are talking about is the latency impact on the guest, i.e. how much guest
performance is affected if KVM swaps MSRs on every exit.

Peter is contending that swapping PMU MSRs with IRQs disabled isn't a big deal,
because the kernel already does as much during a context switch. I agree, *if*
we're talking about only adding the PMU MSRs.

You (and I) are contending that the latency impact on the guest will be too high
if KVM swaps in the inner VM-Exit loop. This is not analogous to host context
switches, as VM-Exits can occur at a much higher frequency than context switches,
and can be triggered by events that have nothing to do with the guest.

There's some confusion here though because of what I said earlier:

We don't want to get too aggressive with moving stuff into the exit_fastpath
loop, because doing too much work with IRQs disabled can cause latency problems
for the host.

By "stuff" I wasn't talking about PMU MSRs, I was referring to all exit handling
that KVM *could* move into the IRQs disabled section in order to mitigate the
concerns that we have about the latency impacts on the guest. E.g. if most exits
are handled in the IRQs disabled section, then KVM could handle most exits without
swapping PMU state and thus limit the impact on guest performance, and not cause
to much host perf "downtime" that I mentioned in the other thread[*].

However, my concern is that handling most exits with IRQs disabled would result
in KVM doing too much work with IRQs disabled, i.e. would impact the host latency
that Peter is talking about. And I'm more than a bit terrified of calling into
code that doesn't expect to be called with IRQs disabled.

Thinking about this more, what if we do a blend of KVM's FPU swapping and debug
register swapping?

A. Load guest PMU state in vcpu_enter_guest() after IRQs are disabled
B. Put guest PMU state (and load host state) in vcpu_enter_guest() before IRQs
are enabled, *if and only if* the current CPU has one or perf events that
wants to use the hardware PMU
C. Put guest PMU state at vcpu_put()
D. Add a perf callback that is invoked from IRQ context when perf wants to
configure a new PMU-based events, *before* actually programming the MSRs,
and have KVM's callback put the guest PMU state

If there are host perf events that want to use the PMU, then KVM will swap fairly
aggressively and the "downtime" of the host perf events will be limited to the
small window around VM-Enter/VM-Exit.

If there are no such host events, KVM will swap on the first entry to the guest,
and keep the guest PMU loaded until the vCPU is put.

The perf callback in (D) would allow perf to program system-wide events on all
CPUs without clobbering guest PMU state.

I think that would make everyone happy. As long as our hosts don't create perf
events, then we get the "swap as little as possible" behavior without significantly
impacting the host's ability to utilize perf. If our host screws up and creates
perf events on CPUs that are running vCPUs, then the degraded vCPU performance is
on us.

Rough sketch below, minus the perf callback or any of actual swapping logic.

[*] https://lore.kernel.org/all/ZR3Ohk50rSofAnSL@xxxxxxxxxx

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 7d9ba301c090..86699d310224 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -41,6 +41,30 @@ struct kvm_pmu_ops {

void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops);

+static inline void kvm_load_guest_pmu(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+ lockdep_assert_irqs_disabled();
+
+ if (vcpu->pmu->guest_state_loaded)
+ return;
+
+ <swap state>
+ vcpu->pmu->guest_state_loaded = true;
+}
+
+static inline void kvm_put_guest_pmu(struct kvm_vcpu *vcpu)
+{
+ lockdep_assert_irqs_disabled();
+
+ if (!vcpu->pmu->guest_state_loaded)
+ return;
+
+ <swap state>
+ vcpu->pmu->guest_state_loaded = false;
+}
+
static inline bool kvm_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu)
{
/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1e645f5b1e2c..93a8f268c37b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4903,8 +4903,20 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)

void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
{
+ unsigned long flags;
int idx;

+ /*
+ * This can get false positives, but not false negatives, i.e. KVM will
+ * never fail to put the PMU, but may unnecessarily disable IRQs to
+ * safely check if the PMU is still loaded.
+ */
+ if (kvm_is_guest_pmu_loaded(vcpu)) {
+ local_irq_save(flags);
+ kvm_put_guest_pmu(vcpu);
+ local_irq_restore(flags);
+ }
+
if (vcpu->preempted) {
if (!vcpu->arch.guest_state_protected)
vcpu->arch.preempted_in_kernel = !static_call(kvm_x86_get_cpl)(vcpu);
@@ -10759,6 +10771,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
set_debugreg(0, 7);
}

+ kvm_load_guest_pmu(vcpu);
+
guest_timing_enter_irqoff();

for (;;) {
@@ -10810,6 +10824,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (hw_breakpoint_active())
hw_breakpoint_restore();

+ if (perf_has_hw_events())
+ kvm_put_guest_pmu(vcpu);
+
vcpu->arch.last_vmentry_cpu = vcpu->cpu;
vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());