[RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM

From: Xiong Zhang
Date: Fri Jan 26 2024 - 04:29:03 EST


Background
===
KVM has supported vPMU for years as the emulated vPMU. In particular, KVM
presents a virtual PMU to guest where accesses to PMU get trapped and
converted into perf events. These perf events get scheduled along with
other perf events at the host level, sharing the HW resource. In the
emulated vPMU design, KVM is a client of the perf subsystem and has no
control of the HW PMU resource at host level.

This emulated vPMU has these drawbacks:
1. Poor performance, guest PMU MSR accessing has VM-exit and some has
expensive host perf API call. Once guest PMU is multiplexing its
counters, KVM will waste majority of time re-creating/starting/releasing
KVM perf events, then the guest perf performance is dropped dramatically.
2. Guest perf events's backend may be swapped out or disabled silently.
This is because host perf scheduler treats KVM perf events and other host
perf events equally, they will contend HW resources. KVM perf events will
be inactive when all HW resources have been owned by host perf events.
But KVM can not notify this backend error into guest, this slient error
is a red flag for vPMU as a production.
3. Hard to add new vPMU features. For each vPMU new feature, KVM needs
to emulate new MSRs, this involves perf and kvm two subsystems, mostly
the vendor specific perf API is added and is hard to accept.

The community has discussed these drawbacks for years and reconsidered
current emulated vPMU [1]. In latest discussion [2], both Perf and KVM
x86 community agreed to try a passthrough vPMU. So we co-work with google
engineers to develop this RFC, currently we implement it on Intel CPU
only, and can add other arch's implementation later.
Complete RFC source code can be found in below link:
https://github.com/googleprodkernel/linux-kvm/tree/passthrough-pmu-rfc

Under passthrough vPMU, VM direct access to all HW PMU general purpose
counters and some of the fixed counters, VM has transparency of x86 PMU
HW. All host perf events using x86 PMU are stopped during VM running, and
are restarted at VM-exit. This has the following benefits:
1. Better performance, when guest access x86 PMU MSRs and rdpmc, no VM-exit
and no host perf API call.
2. Guest perf events exclusively own HW resource during guest running. Host
perf events are stopped and give up HW resource at VM-entry, and restart
runnging after VM-exit.
3. Easier to enable PMU new features. KVM just needs to passthrough new
MSRs and save/restore them at VM-exit and VM-entry, no need to add
perf API.

Note, passthrough vPMU does satisfy the enterprise-level requirement of
secure usage for PMU by intercepting guest access to all event selectors.
But the key problem of passthrough vPMU is that host user loses the
capability to profile guest. If any users want to profile guest from the
host, they should not enable passthrough vPMU mode. Another problem is
the NMI watchdog is not fully functional anymore. Please see design opens
for more details.

Implementation
===
To passthrough host x86 PMU into guest, PMU context switch is mandatory,
this RFC implements this PMU context switch at VM-entry/exit boundary.

At VM-entry:
1. KVM call perf supplied perf_guest_enter() interface, perf stops all
the perf events which use host x86 PMU.
2. KVM call perf supplied perf_guest_switch_to_kvm_pmi_vector() interface,
perf switch PMI vector to a separate kvm_pmi_vector, so that KVM handles
PMI after this point and KVM injects HW PMI into guest.
3. KVM restores guest PMU context.

In order to support KVM PMU filter feature for security, EVENT_SELECT and
FIXED_CTR_CTRL MSRs are intercepted, all other MSRs defined in Architectural
Performance Monitoring spec and rdpmc are passthrough, so guest can access
them without VM exit during guest running, when guest counter overflow
happens, HW PMI is triggered with dedicated kvm_pmi_vector, KVM injects a
virtual PMI into guest through virtual local apic.

At VM-exit:
1. KVM saves and clears guest PMU context.
2. KVM call perf supplied perf_guest_switch_to_host_pmi_vector() interface,
perf switch PMI vector to host NMI, so that host handles PMI after this
point.
3. KVM call perf supplied perf_guest_exit() interface, perf resched all
the perf events, these events stopped at VM-entry will be re-started here.

Design Opens
===
we met some design opens during this POC and seek supporting from
community:

1. host system wide / QEMU events handling during VM running
At VM-entry, all the host perf events which use host x86 PMU will be
stopped. These events with attr.exclude_guest = 1 will be stopped here
and re-started after vm-exit. These events without attr.exclude_guest=1
will be in error state, and they cannot recovery into active state even
if the guest stops running. This impacts host perf a lot and request
host system wide perf events have attr.exclude_guest=1.

This requests QEMU Process's perf event with attr.exclude_guest=1 also.

During VM running, perf event creation for system wide and QEMU
process without attr.exclude_guest=1 fail with -EBUSY.

2. NMI watchdog
the perf event for NMI watchdog is a system wide cpu pinned event, it
will be stopped also during vm running, but it doesn't have
attr.exclude_guest=1, we add it in this RFC. But this still means NMI
watchdog loses function during VM running.

Two candidates exist for replacing perf event of NMI watchdog:
a. Buddy hardlock detector[3] may be not reliable to replace perf event.
b. HPET-based hardlock detector [4] isn't in the upstream kernel.

3. Dedicated kvm_pmi_vector
In emulated vPMU, host PMI handler notify KVM to inject a virtual
PMI into guest when physical PMI belongs to guest counter. If the
same mechanism is used in passthrough vPMU and PMI skid exists
which cause physical PMI belonging to guest happens after VM-exit,
then the host PMI handler couldn't identify this PMI belongs to
host or guest.
So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
has this vector only. The PMI belonging to host still has an NMI
vector.

Without considering PMI skid especially for AMD, the host NMI vector
could be used for guest PMI also, this method is simpler and doesn't
need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we
didn't meet the skid PMI issue on modern Intel processors.

4. per-VM passthrough mode configuration
Current RFC uses a KVM module enable_passthrough_pmu RO parameter,
it decides vPMU is passthrough mode or emulated mode at kvm module
load time.
Do we need the capability of per-VM passthrough mode configuration?
So an admin can launch some non-passthrough VM and profile these
non-passthrough VMs in host, but admin still cannot profile all
the VMs once passthrough VM existence. This means passthrough vPMU
and emulated vPMU mix on one platform, it has challenges to implement.
As the commit message in commit 0011, the main challenge is
passthrough vPMU and emulated vPMU have different vPMU features, this
ends up with two different values for kvm_cap.supported_perf_cap, which
is initialized at module load time. To support it, more refactor is
needed.

Commits construction
===
0000 ~ 0003: Perf extends exclude_guest to stop perf events during
guest running.
0004 ~ 0009: Perf interface for dedicated kvm_pmi_vector.
0010 ~ 0032: all passthrough vPMU with PMU context switch at
VM-entry/exit boundary.
0033 ~ 0037: Intercept EVENT_SELECT and FIXED_CTR_CTRL MSRs for
KVM PMU filter feature.
0038 ~ 0039: Add emulated instructions to guest counter.
0040 ~ 0041: Fixes for passthrough vPMU live migration and Nested VM.

Performance Data
===
Measure method:
First step: guest run workload without perf, and get basic workload score.
Second step: guest run workload with perf commands, and get perf workload
score.
Third step: perf overhead to workload is gotten from (first-second)/first.
Finally: compare perf overhead between emulated vPMU and passthrough vPMU.

Workload: Specint-2017
HW platform: Sapphire rapids, 1 socket, 56 cores, no-SMT
Perf command:
a. basic-sampling: perf record -F 1000 -e 6-instructions -a --overwrite
b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite

Guest performance overhead:
---------------------------------------------------------------------------
| Test case | emulated vPMU | all passthrough | passthrough with |
| | | | event filters |
---------------------------------------------------------------------------
| basic-sampling | 33.62% | 4.24% | 6.21% |
---------------------------------------------------------------------------
| multiplex-sampling | 79.32% | 7.34% | 10.45% |
---------------------------------------------------------------------------
Note: here "passthrough with event filters" means KVM intercepts EVENT_SELECT
and FIXED_CTR_CTRL MSRs to support KVM PMU filter feature for security, this
is current RFC implementation. In order to collect EVENT_SELECT interception
impact, we modified RFC source to passthrough all the MSRs into guest, this
is "all passthrough" in above table.

Conclusion:
1. passthrough vPMU has much better performance than emulated vPMU.
2. Intercept EVENT_SELECT and FIXED_CTR_CTRL MSRs cause 2% overhead.
3. As PMU context switch happens at VM-exit/entry, the more VM-exit,
the more vPMU overhead. This does not only impacts perf, but it also
impacts other benchmarks which have massive VM-exit like fio. We will
optimize this at the second phase of passthrough vPMU.

Remain Works
===
1. To reduce passthrough vPMU overhead, optimize the PMU context switch.
2. Add more PMU features like LBR, PEBS, perf metrics.
3. vPMU live migration.

Reference
===
1. https://lore.kernel.org/lkml/2db2ebbe-e552-b974-fc77-870d958465ba@xxxxxxxxx/
2. https://lkml.kernel.org/kvm/ZRRl6y1GL-7RM63x@xxxxxxxxxx/
3. https://lwn.net/Articles/932497/
4. https://lwn.net/Articles/924927/

Dapeng Mi (4):
x86: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET for passthrough PMU
KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
KVM: x86/pmu: Clear PERF_METRICS MSR for guest

Kan Liang (2):
perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
perf: Support guest enter/exit interfaces

Mingwei Zhang (22):
perf: core/x86: Forbid PMI handler when guest own PMU
perf: core/x86: Plumb passthrough PMU capability from x86_pmu to
x86_pmu_cap
KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and
propage to KVM instance
KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs
KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
KVM: x86/pmu: Allow RDPMC pass through
KVM: x86/pmu: Create a function prototype to disable MSR interception
KVM: x86/pmu: Implement pmu function for Intel CPU to disable MSR
interception
KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with
perf capabilities
KVM: x86/pmu: Whitelist PMU MSRs for passthrough PMU
KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU
context
KVM: x86/pmu: Introduce function prototype for Intel CPU to
save/restore PMU context
KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid
information leakage
KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU
capability
KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
KVM: x86/pmu: Make check_pmu_event_filter() an exported function
KVM: x86/pmu: Allow writing to event selector for GP counters if event
is allowed
KVM: x86/pmu: Allow writing to fixed counter selector if counter is
exposed
KVM: x86/pmu: Introduce PMU helper to increment counter
KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from
non-passthrough vPMU
KVM: nVMX: Add nested virtualization support for passthrough PMU

Xiong Zhang (13):
perf: Set exclude_guest onto nmi_watchdog
perf: core/x86: Add support to register a new vector for PMI handling
KVM: x86/pmu: Register PMI handler for passthrough PMU
perf: x86: Add function to switch PMI handler
perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW
KVM: x86/pmu: Add get virtual LVTPC_MASK bit function
KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
KVM: x86/pmu: Intercept EVENT_SELECT MSR
KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR

arch/x86/events/core.c | 38 +++++
arch/x86/events/intel/core.c | 8 +
arch/x86/events/perf_event.h | 1 +
arch/x86/include/asm/hardirq.h | 1 +
arch/x86/include/asm/idtentry.h | 1 +
arch/x86/include/asm/irq.h | 1 +
arch/x86/include/asm/irq_vectors.h | 2 +-
arch/x86/include/asm/kvm-x86-pmu-ops.h | 3 +
arch/x86/include/asm/kvm_host.h | 8 +
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/perf_event.h | 4 +
arch/x86/include/asm/vmx.h | 1 +
arch/x86/kernel/idt.c | 1 +
arch/x86/kernel/irq.c | 29 ++++
arch/x86/kvm/cpuid.c | 4 +
arch/x86/kvm/lapic.h | 5 +
arch/x86/kvm/pmu.c | 102 ++++++++++++-
arch/x86/kvm/pmu.h | 37 ++++-
arch/x86/kvm/vmx/capabilities.h | 1 +
arch/x86/kvm/vmx/nested.c | 52 +++++++
arch/x86/kvm/vmx/pmu_intel.c | 186 +++++++++++++++++++++--
arch/x86/kvm/vmx/vmx.c | 176 +++++++++++++++++----
arch/x86/kvm/vmx/vmx.h | 3 +-
arch/x86/kvm/x86.c | 37 ++++-
arch/x86/kvm/x86.h | 2 +
include/linux/perf_event.h | 11 ++
kernel/events/core.c | 179 ++++++++++++++++++++++
kernel/watchdog_perf.c | 1 +
tools/arch/x86/include/asm/irq_vectors.h | 1 +
29 files changed, 852 insertions(+), 44 deletions(-)


base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
--
2.34.1