Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

From: David Woodhouse
Date: Tue Nov 07 2023 - 18:36:28 EST


On Tue, 2023-11-07 at 15:07 -0800, Dongli Zhang wrote:
> Thank you very much for the detailed explanation.
>
> I agree it is important to resolve the "now" problem. I guess the KVM lapic
> deadline timer has the "now" problem as well.

I think so. And quite gratuitously so, since it just does:

now = ktime_get();
guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());


Couldn't that trivially be changed to kvm_get_monotonic_and_clockread()?

Thankfully, it's defined in the time domain of the guest TSC, not the
kvmclock, so it doesn't suffer the same drift issue as the Xen timer.

> I just notice my question missed a key prerequisite:
>
> Would you mind helping explain the time domain of the "oneshot.timeout_abs_ns"?
>
> While it is the absolute nanosecond value at the VM side, on which time domain
> it is based?

It's the kvmclock. Xen offers as Xen PV clock to its guests using
*precisely* the same pvclock structure as KVM does.


> 1. Is oneshot.timeout_abs_ns based on the xen pvclock (freq=NSEC_PER_SEC)?
>
> 2. Is oneshot.timeout_abs_ns based on tsc from VM side?
>
> 3. Is oneshot.timeout_abs_ns based on monotonic/raw clock at VM side?
>
> 4. Or it is based on wallclock?
>
> I think the OS does not have a concept of nanoseconds. It is derived from a
> clocksource.

It's the kvmclock.

The guest derives it from the guest TSC using the pvclock information
(mul/shift/offset) that KVM provides to the guest.

The kvm_setup_guest_pvclock() function is potentially called *three*
times from kvm_guest_time_update(). Once for the KVM pv time MSR, once
for the pvclock structure in the Xen vcpu_info, and finally for the
pvclock structure which Xen makes available to userspace for vDSO
timekeeping.

> If it is based on pvclock, is it based on the pvclock from a specific vCPU, as
> both pvclock and timer are per-vCPU.

Yes, it is per-vCPU. Although in the sane case the TSCs on all vCPUs
will match and the mul/shift/offset provided by KVM won't actually
differ. Even in the insane case where guest TSCs are out of sync,
surely the pvclock information will differ only in order to ensure that
the *result* in nanoseconds does not?

I conveniently ducked this question in my patch by only supporting the
CONSTANT_TSC case, and not the case where we happen to know the
(potentially different) TSC frequencies on all the different pCPUs and
vCPUs.


>
> E.g., according to the KVM lapic deadline timer, all values are based on (1) the
> tsc value, (2)on the current vCPU.
>
>
> 1949 static void start_sw_tscdeadline(struct kvm_lapic *apic)
> 1950 {
> 1951         struct kvm_timer *ktimer = &apic->lapic_timer;
> 1952         u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
> 1953         u64 ns = 0;
> 1954         ktime_t expire;
> 1955         struct kvm_vcpu *vcpu = apic->vcpu;
> 1956         unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
> 1957         unsigned long flags;
> 1958         ktime_t now;
> 1959
> 1960         if (unlikely(!tscdeadline || !this_tsc_khz))
> 1961                 return;
> 1962
> 1963         local_irq_save(flags);
> 1964
> 1965         now = ktime_get();
> 1966         guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> 1967
> 1968         ns = (tscdeadline - guest_tsc) * 1000000ULL;
> 1969         do_div(ns, this_tsc_khz);
>
>
> Sorry if I make the question very confusing. The core question is: where and
> from which clocksource the abs nanosecond value is from? What will happen if the
> Xen VM uses HPET as clocksource, while xen timer as clock event?

If the guest uses HPET as clocksource and Xen timer as clockevents,
then keeping itself in sync is the *guest's* problem. The Xen timer is
defined in terms of nanoseconds since guest start, as provided in the
pvclock information described above. Hope that helps!


Attachment: smime.p7s
Description: S/MIME cryptographic signature