Re: [PATCH v2] KVM: x86/xen: improve accuracy of Xen timers

From: Dongli Zhang
Date: Tue Nov 07 2023 - 18:08:19 EST


Hi David,

On 11/7/23 00:17, David Woodhouse wrote:
> On Mon, 2023-11-06 at 17:44 -0800, Dongli Zhang wrote:
>>> +       if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock &&
>>> +           static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
>>
>> If there any reason to use both vcpu->kvm->arch.use_master_clock and
>> X86_FEATURE_CONSTANT_TSC?
>
> Er, paranoia? I'll recheck.
>
>> I think even __get_kvmclock() would not require both cases at the same time?
>>
>>  3071         if (ka->use_master_clock &&
>>  3072             (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
>>
>
> But it does. That requires ka->use_master_clock (line 3071) AND that we
> know the current CPU's TSC frequency (line 3072).
>
> My code insists on the CONSTANT_TSC form of "knowing the current CPU's
> TSC frequency" because even with a get_cpu(), it's not clear the guest
> *was* running on this vCPU when it did its calculations. So I don't
> want to go anywhere near the !CONSTANT_TSC case; it can use the
> fallback.
>
>
>>> +       } else {
>>> +               /*
>>> +                * Without CONSTANT_TSC, get_kvmclock_ns() is the only option.
>>> +                *
>>> +                * Also if the guest PV clock hasn't been set up yet, as is
>>> +                * likely to be the case during migration when the vCPU has
>>> +                * not been run yet. It would be possible to calculate the
>>> +                * scaling factors properly in that case but there's not much
>>> +                * point in doing so. The get_kvmclock_ns() drift accumulates
>>> +                * over time, so it's OK to use it at startup. Besides, on
>>> +                * migration there's going to be a little bit of skew in the
>>> +                * precise moment at which timers fire anyway. Often they'll
>>> +                * be in the "past" by the time the VM is running again after
>>> +                * migration.
>>> +                */
>>> +               guest_now = get_kvmclock_ns(vcpu->kvm);
>>> +               kernel_now = ktime_get();
>>
>> 1. Can I assume the issue is still there if we fall into the "else" case? That
>> is, the increasing inaccuracy as the VM has been up for longer and longer time?
>>
>> If that is the case, which may be better?
>>
>> (1) get_kvmclock_ns(), or
>> (2) always get_kvmclock_base_ns() + ka->kvmclock_offset, when pvclock is not
>> enabled, regardless whether master clock is used. At least, the inaccurary is
>> not going to increase over guest time?
>
> No, those are both wrong, and drifting further away over time. They are
> each *differently* wrong, which is why periodically clamping (1) to (2)
> is also broken, as previously discussed. I know you've got a patch to
> do that clamping more *often* which would slightly reduce the pain
> because the kvmclock wouldn't jump backwards so *far* each time... but
> it's still wrong to do so at all (in either direction).
>
>>
>> 2. I see 3 scenarios here:
>>
>> (1) vcpu->arch.hv_clock.version and master clock is used.
>>
>> In this case, the bugfix looks good.
>>
>> (2) The master clock is used. However, pv clock is not enabled.
>>
>> In this case, the bug is not resolved? ... even the master clock is used.
>
> Under Xen the PV clock is always enabled. It's in the vcpu_info
> structure which is required for any kind of event channel setup.
>
>>
>> (3) The master clock is not used.
>>
>> We fall into get_kvmclock_base_ns() + ka->kvmclock_offset. The behavior is not
>> changed. This looks good.
>>
>>
>> Just from my own point: as this patch involves relatively complex changes, I
>> would suggest resolve the issue, but not use a temporary solution :)
>>
>
> This is the conversation I had with Paul on Tuesday, when he wanted me
> to fix up this "(3) / behaviour is not changed" case. And yes, I argued
> that we *don't* want a temporary solution for this case. Because yes:
>
>> (I see you mentioned that you will be back with get_kvmclock_ns())
>
> We need to fix get_kvmclock_ns() anyway. The systemic drift *and* the
> fact that we periodically clamp it to a different clock and make it
> jump. I was working on the former and have something half-done but was
> preempted by the realisation that the QEMU soft freeze is today, and I
> needed to flush my QEMU patch queue.
>
> But even once we fix get_kvmclock_ns(), *this* patch stands. Because it
> *also* addresses the "now" problem, where we get the time by one clock
> ... and then some time passes ... and we get the time by another clock,
> and subtract one from the other as if they were the *same* time.
>
> Using kvm_get_monotonic_and_clockread() gives us a single TSC read
> corresponding to the CLOCK_MONOTONIC time, from which we can calculate
> the kvmclock time. We just *happen* to calculate it correctly here,
> unlike anywhere else in KVM.
>
>> Based on your bug fix, I see the below cases:
>>
>> If master clock is not used:
>>     get_kvmclock_base_ns() + ka->kvmclock_offset
>>
>> If master clock is used:
>>     If pvclock is enabled:
>>         use the &vcpu->arch.hv_clock to get current guest time
>>     Else
>>         create a temporary hv_clock, based on masterclock.
>
> I don't want to do that last 'else' clause yet, because that feels like
> a temporary workaround. It should be enough to call get_kvmclock_ns(),
> once we fix it.
>
>
>

Thank you very much for the detailed explanation.

I agree it is important to resolve the "now" problem. I guess the KVM lapic
deadline timer has the "now" problem as well.


I just notice my question missed a key prerequisite:

Would you mind helping explain the time domain of the "oneshot.timeout_abs_ns"?

While it is the absolute nanosecond value at the VM side, on which time domain
it is based?

1. Is oneshot.timeout_abs_ns based on the xen pvclock (freq=NSEC_PER_SEC)?

2. Is oneshot.timeout_abs_ns based on tsc from VM side?

3. Is oneshot.timeout_abs_ns based on monotonic/raw clock at VM side?

4. Or it is based on wallclock?

I think the OS does not have a concept of nanoseconds. It is derived from a
clocksource.



If it is based on pvclock, is it based on the pvclock from a specific vCPU, as
both pvclock and timer are per-vCPU.


E.g., according to the KVM lapic deadline timer, all values are based on (1) the
tsc value, (2)on the current vCPU.


1949 static void start_sw_tscdeadline(struct kvm_lapic *apic)
1950 {
1951 struct kvm_timer *ktimer = &apic->lapic_timer;
1952 u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
1953 u64 ns = 0;
1954 ktime_t expire;
1955 struct kvm_vcpu *vcpu = apic->vcpu;
1956 unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
1957 unsigned long flags;
1958 ktime_t now;
1959
1960 if (unlikely(!tscdeadline || !this_tsc_khz))
1961 return;
1962
1963 local_irq_save(flags);
1964
1965 now = ktime_get();
1966 guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
1967
1968 ns = (tscdeadline - guest_tsc) * 1000000ULL;
1969 do_div(ns, this_tsc_khz);


Sorry if I make the question very confusing. The core question is: where and
from which clocksource the abs nanosecond value is from? What will happen if the
Xen VM uses HPET as clocksource, while xen timer as clock event?

Regardless the clocksource, KVM VM may always use current vCPU's tsc in the
lapic deadline timer.

Thank you very much!

Dongli Zhang