[RFC V3 3/6] x86/apic: switch set_next_event to lazy tscdeadline version

From: Wang Jianchao
Date: Sun Jul 16 2023 - 22:36:01 EST


This is the guest side code of lazy tscdeadline. If the cpuid
tell us lazy tscdeadline is enabled, swtich .set_next_event to
lazy tscdeadline version. And Let's explain the core idea here.

Every time guest start or modify a hrtimer, we need to write the
msr of tsc deadline, a vm-exit occurs and host arms a hv or sw
timer for it. However, in some workload that needs setup timer
frequently, msr of tscdeadline is usually overwritten many times
before the timer expires.

w: write msr x: vm-exit t: hv or sw timer

1. write to msr with t1
Guest
w1
----------------------------------------> Time
Host x1 t1
...

n. write to msr with tn
Guest
wn
------------------------------------------> Time
Host xn tn-1 -> tn

What this patch want to do is to eliminate the vm-exit of x2 ... xn

Firstly, we have two fields shared between guest and host as other
pv features, saying,
- armed, the value of tscdeadline that has a timer in host side,
only updated by HOST side
- pending, the next value of tscdeadline, only updated by GUEST
side

1. write to msr with t1
armed : t1 pending : t1
Guest
w1
----------------------------------------> Time
Host x1 t1

vm-exit occurs and arms a timer for t1 in host side

2. write to msr with t2
armed : t1 pending : t2
Guest
w2
------------------------------------------> Time
Host t1

the value of tsc deadline that has been armed, namely t1, is smaller
than t2, needn't to write to msr but just update pending to t2
dd
...
n. write to msr with tn
armed : t1 pending : tn
Guest
wn
------------------------------------------> Time
Host t1

Similar with step 2, just update pending field with tn, no vm-exit

n+1. t1 expires, arm tn
armed : tn pending : tn
Guest

------------------------------------------> Time
Host t1 ------> tn

When we try to update the tscdeadline, if the 'pending' field is
smaller, then we know there is a pending timer, needn' to do msr
write.

Signed-off-by: Li Shujin <arkinjob@xxxxxxxxxxx>
Signed-off-by: Wang Jianchao <jianchwa@xxxxxxxxxxx>
---
arch/x86/kernel/apic/apic.c | 30 +++++++++++++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index af49e24..5aea74f 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -62,6 +62,9 @@
#include <asm/intel-family.h>
#include <asm/irq_regs.h>
#include <asm/cpu.h>
+#include <linux/kvm_para.h>
+
+DECLARE_PER_CPU_DECRYPTED(struct kvm_lazy_tscdeadline, kvm_lazy_tscdeadline);

unsigned int num_processors;

@@ -495,6 +498,26 @@ static int lapic_next_deadline(unsigned long delta,
return 0;
}

+static int kvm_lapic_next_deadline(unsigned long delta,
+ struct clock_event_device *evt)
+{
+ struct kvm_lazy_tscdeadline *lazy_tscddl = this_cpu_ptr(&kvm_lazy_tscdeadline);
+ u64 tsc;
+
+ tsc = rdtsc() + (((u64) delta) * TSC_DIVISOR);
+ lazy_tscddl->pending = tsc;
+ /*
+ * There fence can have two functions:
+ * - avoid the wrmsrl is reordered
+ * - avoid the reorder of writing to pending and reading from armed
+ */
+ weak_wrmsr_fence();
+ if (!lazy_tscddl->armed || tsc < lazy_tscddl->armed)
+ wrmsrl(MSR_IA32_TSC_DEADLINE, tsc);
+
+ return 0;
+}
+
static int lapic_timer_shutdown(struct clock_event_device *evt)
{
unsigned int v;
@@ -639,7 +662,12 @@ static void setup_APIC_timer(void)
levt->name = "lapic-deadline";
levt->features &= ~(CLOCK_EVT_FEAT_PERIODIC |
CLOCK_EVT_FEAT_DUMMY);
- levt->set_next_event = lapic_next_deadline;
+ if (kvm_para_available() &&
+ kvm_para_has_feature(KVM_FEATURE_LAZY_TSCDEADLINE)) {
+ levt->set_next_event = kvm_lapic_next_deadline;
+ } else {
+ levt->set_next_event = lapic_next_deadline;
+ }
clockevents_config_and_register(levt,
tsc_khz * (1000 / TSC_DIVISOR),
0xF, ~0UL);
--
2.7.4