Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

From: Sean Christopherson
Date: Wed Oct 04 2023 - 16:43:56 EST


On Tue, Oct 03, 2023, Mingwei Zhang wrote:
> On Mon, Oct 2, 2023 at 5:56 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > The "when" is what's important. If KVM took a literal interpretation of
> > "exclude guest" for pass-through MSRs, then KVM would context switch all those
> > MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
> > reschedule IRQ to schedule in a different task (or vCPU). The overhead to save
> > all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
> > VM-Exit would be a non-starter. E.g. simple VM-Exits are completely handled in
> > <1500 cycles, and "fastpath" exits are something like half that. Switching all
> > the MSRs is likely 1000+ cycles, if not double that.
>
> Hi Sean,
>
> Sorry, I have no intention to interrupt the conversation, but this is
> slightly confusing to me.
>
> I remember when doing AMX, we added gigantic 8KB memory in the FPU
> context switch. That works well in Linux today. Why can't we do the
> same for PMU? Assuming we context switch all counters, selectors and
> global stuff there?

That's what we (Google folks) are proposing. However, there are significant
side effects if KVM context switches PMU outside of vcpu_run(), whereas the FPU
doesn't suffer the same problems.

Keeping the guest FPU resident for the duration of vcpu_run() is, in terms of
functionality, completely transparent to the rest of the kernel. From the kernel's
perspective, the guest FPU is just a variation of a userspace FPU, and the kernel
is already designed to save/restore userspace/guest FPU state when the kernel wants
to use the FPU for whatever reason. And crucially, kernel FPU usage is explicit
and contained, e.g. see kernel_fpu_{begin,end}(), and comes with mechanisms for
KVM to detect when the guest FPU needs to be reloaded (see TIF_NEED_FPU_LOAD).

The PMU is a completely different story. PMU usage, a.k.a. perf, by design is
"always running". KVM can't transparently stop host usage of the PMU, as disabling
host PMU usage stops perf events from counting/profiling whatever it is they're
supposed to profile.

Today, KVM minimizes the "downtime" of host PMU usage by context switching PMU
state at VM-Enter and VM-Exit, or at least as close as possible, e.g. for LBRs
and Intel PT.

What we are proposing would *significantly* increase the downtime, to the point
where it would almost be unbounded in some paths, e.g. if KVM faults in a page,
gup() could go swap in memory from disk, install PTEs, and so on and so forth.
If the host is trying to profile something related to swap or memory management,
they're out of luck.