Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

From: David Woodhouse
Date: Wed Nov 08 2023 - 14:14:54 EST


On Wed, 2023-11-08 at 10:22 -0800, Dongli Zhang wrote:
>
> Thank you very much for the explanation.
>
> I understand you may use different methods to obtain the 'expire' under
> different cases.
>
> Maybe add some comments in the KVM code of xen timer emulation? E.g.:
>
> - When the TSC is reliable, follow the standard/protocol that xen timer is
> per-vCPU pvclock based: that is, to always scale host_tsc with kvm_read_l1_tsc().
>
> - However, sometimes TSC is not reliable. Use the legacy method get_kvmclock_ns().

After the patch we're discussing here, kvm_xen_start_timer() is *more*
comment than code. I think it covers both of the points you mention
above, and more.


static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs,
bool linux_wa)
{
uint64_t guest_now;
int64_t kernel_now, delta;

/*
* The guest provides the requested timeout in absolute nanoseconds
* of the KVM clock — as *it* sees it, based on the scaled TSC and
* the pvclock information provided by KVM.
*
* The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW
* so use CLOCK_MONOTONIC. In the timescales covered by timers, the
* difference won't matter much as there is no cumulative effect.
*
* Calculate the time for some arbitrary point in time around "now"
* in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the
* delta between the kvmclock "now" value and the guest's requested
* timeout, apply the "Linux workaround" described below, and add
* the resulting delta to the CLOCK_MONOTONIC "now" value, to get
* the absolute CLOCK_MONOTONIC time at which the timer should
* fire.
*/
if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock &&
static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
uint64_t host_tsc, guest_tsc;

if (!IS_ENABLED(CONFIG_64BIT) ||
!kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) {
/*
* Don't fall back to get_kvmclock_ns() because it's
* broken; it has a systemic error in its results
* because it scales directly from host TSC to
* nanoseconds, and doesn't scale first to guest TSC
* and then* to nanoseconds as the guest does.
*
* There is a small error introduced here because time
* continues to elapse between the ktime_get() and the
* subsequent rdtsc(). But not the systemic drift due
* to get_kvmclock_ns().
*/
kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */
host_tsc = rdtsc();
}

/* Calculate the guest kvmclock as the guest would do it. */
guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock,
guest_tsc);
} else {
/*
* Without CONSTANT_TSC, get_kvmclock_ns() is the only option.
*
* Also if the guest PV clock hasn't been set up yet, as is
* likely to be the case during migration when the vCPU has
* not been run yet. It would be possible to calculate the
* scaling factors properly in that case but there's not much
* point in doing so. The get_kvmclock_ns() drift accumulates
* over time, so it's OK to use it at startup. Besides, on
* migration there's going to be a little bit of skew in the
* precise moment at which timers fire anyway. Often they'll
* be in the "past" by the time the VM is running again after
* migration.
*/
guest_now = get_kvmclock_ns(vcpu->kvm);
kernel_now = ktime_get();
}

delta = guest_abs - guest_now;

/* Xen has a 'Linux workaround' in do_set_timer_op() which
* checks for negative absolute timeout values (caused by
* integer overflow), and for values about 13 days in the
* future (2^50ns) which would be caused by jiffies
* overflow. For those cases, it sets the timeout 100ms in
* the future (not *too* soon, since if a guest really did
* set a long timeout on purpose we don't want to keep
* churning CPU time by waking it up).
*/
if (linux_wa) {
if ((unlikely((int64_t)guest_abs < 0 ||
(delta > 0 && (uint32_t) (delta >> 50) != 0)))) {
delta = 100 * NSEC_PER_MSEC;
guest_abs = guest_now + delta;
}
}

/*
* Avoid races with the old timer firing. Checking timer_expires
* to avoid calling hrtimer_cancel() will only have false positives
* so is fine.
*/
if (vcpu->arch.xen.timer_expires)
hrtimer_cancel(&vcpu->arch.xen.timer);

atomic_set(&vcpu->arch.xen.timer_pending, 0);
vcpu->arch.xen.timer_expires = guest_abs;

if (delta <= 0) {
xen_timer_callback(&vcpu->arch.xen.timer);
} else {
hrtimer_start(&vcpu->arch.xen.timer,
ktime_add_ns(kernel_now, delta),
HRTIMER_MODE_ABS_HARD);
}
}



> This may help developers understand the standard/protocol used by xen timer. The
> core idea will be: the implementation is trying to following the xen timer
> nanoseconds definition (per-vCPU pvclock), and it may use other legacy solution
> under special case, in order to improve the accuracy.

This isn't special to the Xen timers. We really are just using the
kvmclock for this. It's the same thing.

> TBH, I never think about what the definition of nanosecond is in xen timer (even
> I used to and I am still working on some xen issue).

You never had to think about it before because it was never quite so
catastrophically broken as it is under KVM. Nanoseconds were
nanoseconds and we didn't have problems because we scale and calculate
them three *different* ways and periodically clamp the guest-visible
one to a fourth. :)

Attachment: smime.p7s
Description: S/MIME cryptographic signature