Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

From: Dongli Zhang
Date: Tue Nov 07 2023 - 20:43:50 EST


Hi David,

On 11/7/23 15:24, David Woodhouse wrote:
> On Tue, 2023-11-07 at 15:07 -0800, Dongli Zhang wrote:
>> Thank you very much for the detailed explanation.
>>
>> I agree it is important to resolve the "now" problem. I guess the KVM lapic
>> deadline timer has the "now" problem as well.
>
> I think so. And quite gratuitously so, since it just does:
>
> now = ktime_get();
> guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>
>
> Couldn't that trivially be changed to kvm_get_monotonic_and_clockread()?

The core idea is to always capture the pair of (tsc, ns) at exactly the same
time point.

I have no idea how much accuracy it can improve, considering the extra costs to
inject the timer interrupt into the vCPU.

>
> Thankfully, it's defined in the time domain of the guest TSC, not the
> kvmclock, so it doesn't suffer the same drift issue as the Xen timer.
>
>> I just notice my question missed a key prerequisite:
>>
>> Would you mind helping explain the time domain of the "oneshot.timeout_abs_ns"?
>>
>> While it is the absolute nanosecond value at the VM side, on which time domain
>> it is based?
>
> It's the kvmclock. Xen offers as Xen PV clock to its guests using
> *precisely* the same pvclock structure as KVM does.
>
>
>> 1. Is oneshot.timeout_abs_ns based on the xen pvclock (freq=NSEC_PER_SEC)?
>>
>> 2. Is oneshot.timeout_abs_ns based on tsc from VM side?
>>
>> 3. Is oneshot.timeout_abs_ns based on monotonic/raw clock at VM side?
>>
>> 4. Or it is based on wallclock?
>>
>> I think the OS does not have a concept of nanoseconds. It is derived from a
>> clocksource.
>
> It's the kvmclock.

Thank you very much!

Both xen clock and kvm clock are pvclock based on the same equations.

>
> The guest derives it from the guest TSC using the pvclock information
> (mul/shift/offset) that KVM provides to the guest.
>
> The kvm_setup_guest_pvclock() function is potentially called *three*
> times from kvm_guest_time_update(). Once for the KVM pv time MSR, once
> for the pvclock structure in the Xen vcpu_info, and finally for the
> pvclock structure which Xen makes available to userspace for vDSO
> timekeeping.
>
>> If it is based on pvclock, is it based on the pvclock from a specific vCPU, as
>> both pvclock and timer are per-vCPU.
>
> Yes, it is per-vCPU. Although in the sane case the TSCs on all vCPUs
> will match and the mul/shift/offset provided by KVM won't actually
> differ. Even in the insane case where guest TSCs are out of sync,
> surely the pvclock information will differ only in order to ensure that
> the *result* in nanoseconds does not?
>
> I conveniently ducked this question in my patch by only supporting the
> CONSTANT_TSC case, and not the case where we happen to know the
> (potentially different) TSC frequencies on all the different pCPUs and
> vCPUs.

This is also my question that why to support only the CONSTANT_TSC case.

For the lapic timer case:

The timer is always calculated based on the *current* vCPU's tsc virtualization,
regardless CONSTANT_TSC or not.

For the xen timer case:

Why not always calculate the expire based on the *current* vCPU's time
virtualization? That is, why not always use the current vCPU's hv_clock,
regardless CONSTANT_TSC/masteclock?

That is: kvm lapic method with kvm_get_monotonic_and_clockread().

>
>
>>
>> E.g., according to the KVM lapic deadline timer, all values are based on (1) the
>> tsc value, (2)on the current vCPU.
>>
>>
>> 1949 static void start_sw_tscdeadline(struct kvm_lapic *apic)
>> 1950 {
>> 1951         struct kvm_timer *ktimer = &apic->lapic_timer;
>> 1952         u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
>> 1953         u64 ns = 0;
>> 1954         ktime_t expire;
>> 1955         struct kvm_vcpu *vcpu = apic->vcpu;
>> 1956         unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
>> 1957         unsigned long flags;
>> 1958         ktime_t now;
>> 1959
>> 1960         if (unlikely(!tscdeadline || !this_tsc_khz))
>> 1961                 return;
>> 1962
>> 1963         local_irq_save(flags);
>> 1964
>> 1965         now = ktime_get();
>> 1966         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>> 1967
>> 1968         ns = (tscdeadline - guest_tsc) * 1000000ULL;
>> 1969         do_div(ns, this_tsc_khz);
>>
>>
>> Sorry if I make the question very confusing. The core question is: where and
>> from which clocksource the abs nanosecond value is from? What will happen if the
>> Xen VM uses HPET as clocksource, while xen timer as clock event?
>
> If the guest uses HPET as clocksource and Xen timer as clockevents,
> then keeping itself in sync is the *guest's* problem. The Xen timer is
> defined in terms of nanoseconds since guest start, as provided in the
> pvclock information described above. Hope that helps!
>

The "in terms of nanoseconds since guest start" refers to *one* global value.
Should we use wallclock when we are referring to a global value shared by all vCPUs?


Based on the following piece of code, I do not think we may assume all vCPUs
have the same pvclock at the same time point: line 104-108, when
PVCLOCK_TSC_STABLE_BIT is not set.


67 static __always_inline
68 u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd)
69 {
70 unsigned version;
71 u64 ret;
72 u64 last;
73 u8 flags;
74
75 do {
76 version = pvclock_read_begin(src);
77 ret = __pvclock_read_cycles(src, rdtsc_ordered());
78 flags = src->flags;
79 } while (pvclock_read_retry(src, version));
... ...
104 last = raw_atomic64_read(&last_value);
105 do {
106 if (ret <= last)
107 return last;
108 } while (!raw_atomic64_try_cmpxchg(&last_value, &last, ret));
109
110 return ret;
111 }


That's why I appreciate a definition of the abs nanoseconds used by the xen
timer (e.g., derived from pvclock). If it is per-vCPU, we may not use it for a
global "in terms of nanoseconds since guest start", when PVCLOCK_TSC_STABLE_BIT
is not set.


Thank you very much!

Dongli Zhang