KVM: x86: Reconsider the current approach of vPMU

From: Like Xu
Date: Wed Feb 09 2022 - 03:11:31 EST


Changed the subject to attract more attention.

On 3/2/2022 6:35 am, Jim Mattson wrote:
On Wed, Feb 2, 2022 at 6:43 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

Urgh... hate on kvm being a module again. We really need something like
EXPORT_SYMBOL_KVM() or something.

Perhaps we should reconsider the current approach of treating the
guest as a client of the host perf subsystem via kvm as a proxy. There

The story of vPMU begins with the perf_event_create_kernel_counter()
interface which is a generic API in the kernel mode.

are several drawbacks to the current approach:
1) If the guest actually sets the counter mask (and invert counter
mask) or edge detect in the event selector, we ignore it, because we
have no way of requesting that from perf.

We need more guest user cases and voices when it comes to vPMU
capabilities on a case-by-case basis (invert counter mask or edge detect).

KVM may set these bits before vm-entry if it does not affect the host.

2) If a system-wide pinned counter preempts one of kvm's thread-pinned
counters, we have no way of letting the guest know, because the
architectural specification doesn't allow counters to be suspended.

One such case is NMI watchdog. The truth is that KVM can check the status
of the event before vm-entry to know that the back-end counter has been
rescheduled to another perf user, but can't do anything about it.

I had drafted a vPMU notification patch set to synchronize the status of the
back-end counters to the guest, using the PV method with the help of vPMI.

I'm sceptical about this direction and the efficiency of the notification mechanism
I have designed but I do hope that others have better ideas and quality code.

The number of counters is relatively plenty, but it's a pain in the arse for LBR,
and I may post out a slow path with a high performance cost if you're interested in.

3) TDX is going to pull the rug out from under us anyway. When the TDX
module usurps control of the PMU, any active host counters are going
to stop counting. We are going to need a way of telling the host perf

I presume that performance counters data of TDX guest is isolated for host,
and host counters (from host perf agent) will not stop and keep counting
only for TDX guests in debug mode.

Off-topic, not all of the capabilities of the core-PMU can or should be
used by TDX guests (given that the behavior of firmware for PMU resource
is constantly changing and not even defined).

subsystem what's happening, or other host perf clients are going to
get bogus data.

I predict perf core will be patched to sense (via callback, KVM notifies perf,
smart perf_event running time or host stable TSC diff) and report this kind
of data holes from TDX, SGX, AMD-SEV in the report.


Given what's coming with TDX, I wonder if we should just bite the
bullet and cede the PMU to the guest while it's running, even for
non-TDX guests. That would solve (1) and (2) as well.

The idea of "cede the PMU to the guest" or "vPMU pass-through" is not really
new to us, there are two main obstacles: one is the isolation of NMI paths
(including GLOBAL_* MSRs, like STATUS); the other is avoiding guest sniffing
host data which is a perfect method for guest escape attackers.

At one time, we proposed to statically reserve counters from the host
perf view at guest startup, but this option was NAK-ed from PeterZ.

We may have a chance to reserve counters at the host startup
until we have a guest PMU more friendly hardware design. :D

I'd like to add one more vPMU discussion point: support for
heterogeneous VMX cores (run vPMU on the ADL or later).

The current capability of vPMU depends on which physical CPU
the KVM module is initialized on, and the current upstream
solution brings concerns in terms of vCPU compatibility and migration.

Thanks,
Like Xu