[RFC PATCH 0/8] Dynamic vcpu priority management in kvm

From: Vineeth Pillai (Google)
Date: Wed Dec 13 2023 - 21:47:38 EST


Double scheduling is a concern with virtualization hosts where the host
schedules vcpus without knowing whats run by the vcpu and guest schedules
tasks without knowing where the vcpu is physically running. This causes
issues related to latencies, power consumption, resource utilization
etc. An ideal solution would be to have a cooperative scheduling
framework where the guest and host shares scheduling related information
and makes an educated scheduling decision to optimally handle the
workloads. As a first step, we are taking a stab at reducing latencies
for latency sensitive workloads in the guest.

This series of patches aims to implement a framework for dynamically
managing the priority of vcpu threads based on the needs of the workload
running on the vcpu. Latency sensitive workloads (nmi, irq, softirq,
critcal sections, RT tasks etc) will get a boost from the host so as to
minimize the latency.

The host can proactively boost the vcpu threads when it has enough
information about what is going to run on the vcpu - fo eg: injecting
interrupts. For rest of the case, guest can request boost if the vcpu is
not already boosted. The guest can subsequently request unboost after
the latency sensitive workloads completes. Guest can also request a
boost if needed.

A shared memory region is used to communicate the scheduling information.
Guest shares its needs for priority boosting and host shares the boosting
status of the vcpu. Guest sets a flag when it needs a boost and continues
running. Host reads this on next VMEXIT and boosts the vcpu thread. For
unboosting, it is done synchronously so that host workloads can fairly
compete with guests when guest is not running any latency sensitive
workload.

This RFC is x86 specific. This is mostly feature complete, but more work
needs to be done on the following areas:
- Use of paravirt ops framework.
- Optimizing critical paths for speed, cache efficiency etc
- Extensibility of this idea for sharing more scheduling information to
make better educated scheduling decisions in guest and host.
- Prevent misuse by rogue/buggy guest kernels

Tests
------

Real world workload on chromeos shows considerable improvement. Audio
and video applications running on low end devices experience high
latencies when the system is under load. This patch helps in mitigating
the audio and video glitches caused due to scheduling latencies.

Following are the results from oboetester app on android vm running in
chromeos. This app tests for audio glitches.

-------------------------------------------------------
| | Noload || Busy |
| Buffer Size |----------------------------------------
| | Vanilla | patches || Vanilla | Patches |
-------------------------------------------------------
| 96 (2ms) | 20 | 4 || 1365 | 67 |
-------------------------------------------------------
| 256 (4ms) | 3 | 1 || 524 | 23 |
-------------------------------------------------------
| 512 (10ms) | 0 | 0 || 25 | 24 |
-------------------------------------------------------

Noload: Tests run on idle system
Busy: Busy system simulated by Speedometer benchmark

The test shows considerable reduction in glitches especially with
smaller buffer sizes.

Following are data collected from few micro benchmark tests. cyclictest
was run on a VM to measure the latency with and without the patches. We
also took a baseline of the results with all vcpus statically boosted to
RT(chrt). This is to observe the difference between dynamic and static
boosting and its effect on host as well. Cyclictest on guest is to
observe the effect of the patches on guest and cyclictest on host is to
see if the patch affects workloads on the host.

cyclictest is run on both host and guest.
cyclictest cmdline: "cyclictest -q -D 90s -i 500 -d $INTERVAL"
where $INTERVAL used was 500 and 1000 us.

Host is Intel N4500 4C/4T. Guest also has 4 vcpus.

In the following tables,
Vanilla: baseline: vanilla kernel
Dynamic: the patches applied
Static: baseline: all vcpus statically boosted to RT(chrt)

Idle tests
----------
The Host is idle and cyclictest on host and guest.

-----------------------------------------------------------------------
| | Avg Latency(us): Guest || Avg Latency(us): Host |
-----------------------------------------------------------------------
| Interval | vanilla | dynamic | static || vanilla | dynamic | static |
-----------------------------------------------------------------------
| 500 | 9 | 9 | 10 || 5 | 3 | 3 |
-----------------------------------------------------------------------
| 1000 | 34 | 35 | 35 || 5 | 3 | 3 |
----------------------------------------------------------------------

-----------------------------------------------------------------------
| | Max Latency(us): Guest || Max Latency(us): Host |
-----------------------------------------------------------------------
| Interval | vanilla | dynamic | static || vanilla | dynamic | static |
-----------------------------------------------------------------------
| 500 | 1577 | 1433 | 140 || 1577 | 1526 | 15969 |
-----------------------------------------------------------------------
| 1000 | 6649 | 765 | 204 || 697 | 174 | 2444 |
-----------------------------------------------------------------------

Busy Tests
----------
Here the a busy host was simulated using stress-ng and cyclictest was
run on both host and guest.

-----------------------------------------------------------------------
| | Avg Latency(us): Guest || Avg Latency(us): Host |
-----------------------------------------------------------------------
| Interval | vanilla | dynamic | static || vanilla | dynamic | static |
-----------------------------------------------------------------------
| 500 | 887 | 21 | 25 || 6 | 6 | 7 |
-----------------------------------------------------------------------
| 1000 | 6335 | 45 | 38 || 11 | 11 | 14 |
----------------------------------------------------------------------

-----------------------------------------------------------------------
| | Max Latency(us): Guest || Max Latency(us): Host |
-----------------------------------------------------------------------
| Interval | vanilla | dynamic | static || vanilla | dynamic | static |
-----------------------------------------------------------------------
| 500 | 216835 | 13978 | 1728 || 2075 | 2114 | 2447 |
-----------------------------------------------------------------------
| 1000 | 199575 | 70651 | 1537 || 1886 | 1285 | 27104 |
-----------------------------------------------------------------------

These patches are rebased on 6.5.10.
Patches 1-4: Implementation of the core host side feature
Patch 5: A naive throttling mechanism for limiting boosted duration
for preemption disabled state in the guest. This is a placeholder for
the throttling mechanism for now and would need to be implemented
differently
Patch 6: Enable/disable tunables - global and per-vm
Patches 7-8: Implementation of the code guest side feature

---
Vineeth Pillai (Google) (8):
kvm: x86: MSR for setting up scheduler info shared memory
sched/core: sched_setscheduler_pi_nocheck for interrupt context usage
kvm: x86: vcpu boosting/unboosting framework
kvm: x86: boost vcpu threads on latency sensitive paths
kvm: x86: upper bound for preemption based boost duration
kvm: x86: enable/disable global/per-guest vcpu boost feature
sched/core: boost/unboost in guest scheduler
irq: boost/unboost in irq/nmi entry/exit and softirq

arch/x86/Kconfig | 13 +++
arch/x86/include/asm/kvm_host.h | 69 ++++++++++++
arch/x86/include/asm/kvm_para.h | 7 ++
arch/x86/include/uapi/asm/kvm_para.h | 43 ++++++++
arch/x86/kernel/kvm.c | 16 +++
arch/x86/kvm/Kconfig | 12 +++
arch/x86/kvm/cpuid.c | 2 +
arch/x86/kvm/i8259.c | 2 +-
arch/x86/kvm/lapic.c | 8 +-
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/x86.c | 154 +++++++++++++++++++++++++++
include/linux/kvm_host.h | 56 ++++++++++
include/linux/sched.h | 23 ++++
include/uapi/linux/kvm.h | 5 +
kernel/entry/common.c | 39 +++++++
kernel/sched/core.c | 127 +++++++++++++++++++++-
kernel/softirq.c | 11 ++
virt/kvm/kvm_main.c | 150 ++++++++++++++++++++++++++
19 files changed, 730 insertions(+), 11 deletions(-)

--
2.43.0