[RFC V3 0/6] KVM: x86: introduce pv feature lazy tscdeadline

From: Wang Jianchao
Date: Sun Jul 16 2023 - 22:35:45 EST


Hi

This patchset attemps to introduce a new pv feature, lazy tscdeadline.

Before this patch, every time guest start or modify a hrtimer, we need to write the msr of tsc deadline,
a vm-exit occurs and host arms a hv or sw timer for it.

w: write msr
x: vm-exit
t: hv or sw timer

Guest
w
---------------------------------------> Time
Host x t


However, in some workload that needs setup timer frequently, msr of tscdeadline is usually overwritten
many times before the timer expires. And every time we modify the tscdeadline, a vm-exit ocurrs


1. write to msr with t0

Guest
w1
----------------------------------------> Time
Host x1 t1


2. write to msr with t2
Guest
w2
------------------------------------------> Time
Host x2 t1->t2


2. write to msr with t3
Guest
w3
------------------------------------------> Time
Host x3 t2->t3


3. write to msr with t4
Guest
w4
------------------------------------------> Time
Host x4 t3->t4


What this patch want to do is to eliminate the vm-exit of x2 x3 and x4 as following,


Firstly, we have two fields shared between guest and host as other pv features, saying,
- armed, the value of tscdeadline that has a timer in host side, only updated by __host__ side
Everytime the host side arm timer of tscdeadline mode, it update @armed
- pending, the next value of tscdeadline, only updated by __guest__ side. Everytime the guest
invoke kvm_lapic_next_deadline (lazy_tscdeadline version set_next_event callback), it updates
the @pending no matter jumps to wrmsrl

In guest side, saying we want to set tscdeadline to t, we needs to update @pending first, then,
- if @armed is zero, or t < @armed, jumps to wrmsrl to trap int host to arm the timer
- if t >= @armed, just returns

In host side,
- if @pending == @armed, inject local timer interrupt
- if @pending > @armed, just re-arm the timer
- there shouldn't be case @pending < @armed, the guest side will trap into host to update @armed
in this case

1. write to msr with t1

armed : t1
pending : t1
Guest
w1
----------------------------------------> Time
Host x1 t1

vm-exit occurs and arms a timer for t1 in host side


2. write to msr with t2

armed : t1
pending : t2

Guest
w2
------------------------------------------> Time
Host t1

the value of tsc deadline that has been armed, namely t1, is smaller than t2, needn't to write
to msr but just update pending


3. write to msr with t3

armed : t1
pending : t3

Guest
w3
------------------------------------------> Time
Host t1

Similar with step 2, just update pending field with t3, no vm-exit


4. write to msr with t4

armed : t1
pending : t4

Guest
w4
------------------------------------------> Time
Host t1
Similar with step 2, just update pending field with t4, no vm-exit


5. t1 expires, arm t4

armed : t4
pending : t4


Guest

------------------------------------------> Time
Host t1 ------> t4

t1 is fired, it checks the pending field and re-arm a timer based on it.

In this case, the vm-exit caused by writing msr of tsc deadline for t2 t3 t4
is reduced. Even thougth t1 causes another vm-exit of preemption-timer, but
we win 2 in this case.

Here is the test results of netperf TCP-RR on loopback:

VM-Exit: Close Open
sum 10485133 6177331
halt 2082894 2958096
msr-write 8323993 3140474
preemption-timer 36036 42064
-------------------------------------------
MSR:
sum 8324075 3140518
apic-icr 2115802 2969154
tsc-deadline 6208273 171364
---------------------------------------------
Intrrupts:
236 44003 55059
251 2081941 2943361

Note:
- Host kernel is 6.5-rc1
- Guest kernel is 5.14 + patch

This patchset includes 6 patches,

The 1st patch, KVM: x86: add msr register and data structure for lazy tscdeadline
add msr register, feature flag and data structure for this new feature. There is
no functional changes in this patch.

The 2nd patch, KVM: x86: exchange info about lazy_tscdeadline with msr
Exchange the gpa of kvm_lazy_tscdeadline data structure between gust and
host.

The 3rd patch, x86/apic: switch set_next_event to lazy tscdeadline version
If lazy_tscdeadline is enabled, switch the set_next_event callback from
lapic_next_deadline to kvm_lapic_next_deadline.

The 4th patch, KVM: x86: do lazy_tscdeadline init and exit
Do some init and exit jobs of lazy_tscdeadline. It pins the page at which the gpa
of kvm_lazy_tscdeadline locates and maps it to kernel space. The exit path will
release them.

The 5th patch, KVM: X86: add lazy tscdeadline support to reduce vm-exit of msr-write
It introduces the update, kick and clear operations to make lazy_tscdeadline
work in host side. Refer to following comment,
- UPDATE, when the guest update msr of tsc deadline, we need to
update the value of 'armed' field of kvm_lazy_tscdeadline
- KICK, when the hv or sw timer is fired, we need to check the
'pending' field to decide whether to re-arm timer or inject
local timer vector. The sw timer is not in vcpu context, so a
new kvm req is added to handle the kick in vcpu context.
- CLEAR, this is a bit tricky. We need to clear the 'armed' field
properly otherwise the guestOS can be hung.

The 6th patch, KVM: x86: add debugfs file for lazy tscdeadline per vcpu
Add a debug entry for this feature.


Changes from V2:
- Comments and chart in cover letter and patches are rewritten
- Move weak_wrmsr_fence after updating @pending the avoid re-order of update
@pending and read @armed
- Split the orignial 3rd patch into 3 to reduce the size of patches
- Avoid to inject interrupt into guest when lazy tscdeadline timer is kicked
- Add kvm_vcpu_kick() when write to lazy_tscdeadline debugfs interface

Changes from V1:
- In 3rd patch, rename the variable of kvm_host_lazy_tscdeadline from 'host'
to 'hlt'. And in addition, add more details into the comment of patch
- Add 4th patch which add debugfs file for this patch

Any comment is welcome.

Thanks
Jianchao

Wang Jianchao (6)
KVM: x86: add debugfs file for lazy tscdeadline per vcpu
KVM: X86: add lazy tscdeadline support to reduce vm-exit of msr-write
KVM: x86: do lazy_tscdeadline init and exit
x86/apic: switch set_next_event to lazy tscdeadline version
KVM: x86: exchange info about lazy_tscdeadline with msr
KVM: x86: add msr register and data structure for lazy tscdeadline


arch/x86/include/asm/kvm_host.h | 10 ++++++++
arch/x86/kernel/apic/apic.c | 30 +++++++++++++++++++++-
arch/x86/kernel/kvm.c | 13 ++++++++++
arch/x86/kvm/debugfs.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/lapic.c | 138 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
arch/x86/kvm/lapic.h | 4 +++
arch/x86/kvm/x86.c | 27 ++++++++++++++++++++
7 files changed, 291 insertions(+), 11 deletions(-)