Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically

From: Dongli Zhang
Date: Mon Oct 02 2023 - 21:33:29 EST


Hi Sean,

A quick question ...

On 10/2/23 17:53, Sean Christopherson wrote:
> On Mon, Oct 02, 2023, David Woodhouse wrote:
>> On Mon, 2023-10-02 at 09:37 -0700, Sean Christopherson wrote:
>>> On Mon, Oct 02, 2023, David Woodhouse wrote:
>>>> On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote:
>>>>>
>>>>> 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation.
>>>>>
>>>>> 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation.
>>>>>
>>>>
>>>> That just seems wrong. I don't mean that you're incorrect; it seems
>>>> *morally* wrong.
>>>>
>>>> In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use
>>>> a *different* mult/shift/equation (your #1) to convert TSC ticks to
>>>> nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2).
>>>>
>>>> I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's
>>>> adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent.
>>>>
>>>> Fix that, and the whole problem goes away, doesn't it?
>>>>
>>>> What am I missing here, that means we can't do that?
>>>
>>> I believe the answer is that "struct pvclock_vcpu_time_info" and its math are
>>> ABI between KVM and KVM guests.
>>>
>>> Like many of the older bits of KVM, my guess is that KVM's behavior is the product
>>> of making things kinda sorta work with old hardware, i.e. was probably the least
>>> awful solution in the days before constant TSCs, but is completely nonsensical on
>>> modern hardware.
>>
>> I still don't understand. The ABI and its math are fine. The ABI is just
>> "at time X the TSC was Y, and the TSC frequency is Z"
>>
>> I understand why on older hardware, those values needed to *change*
>> occasionally when TSC stupidity happened.
>>
>> But on newer hardware, surely we can set them precisely *once* when the
>> VM starts, and never ever have to change them again? Theoretically not
>> even when we pause the VM, kexec into a new kernel, and resume the VM!
>>
>> But we *are* having to change it, because apparently
>> CLOCK_MONOTONIC_RAW is doing something *other* than incrementing at
>> precisely the frequency of the known and constant TSC.
>>
>> But *why* is CLOCK_MONOTONIC_RAW doing that? I thought that the whole
>> point of CLOCK_MONOTONIC_RAW was to be consistent and not adjusted by
>> NTP etc.? Shouldn't it run at precisely the same rate as the kvmclock,
>> with no skew at all?
>
> IIUC, the issue is that the paravirt clock ends up mixing time domains, i.e. is
> a weird bastardization of the host's monotonic raw clock and the paravirt clock.
>
> Despite a name that suggests otherwise (to me at least), __pvclock_read_cycles()
> counts "cycles" in nanoseconds, not TSC ticks.
>
> u64 __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, u64 tsc)
> {
> u64 delta = tsc - src->tsc_timestamp;
> u64 offset = pvclock_scale_delta(delta, src->tsc_to_system_mul,
> src->tsc_shift);
> return src->system_time + offset;
> }
>
> In the above, "offset" is computed in the kvmclock domain, whereas system_time
> comes from the host's CLOCK_MONOTONIC_RAW domain by way of master_kernel_ns.
> The goofy math is much more obvious in __get_kvmclock(), i.e. KVM's host-side
> retrieval of the guest's view of kvmclock:
>
> hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
>
> The two domains use the same "clock" (constant TSC), but different math to compute
> nanoseconds from a given TSC value. For decently large TSC values, this results
> in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds.
>
> When KVM decides to refresh the masterclock, e.g. vCPU hotplug in Dongli's case,
> it resets the base time, a.k.a. system_time. I.e. KVM takes all of the time that
> was accumulated in the kvmclock domain and recomputes it in the CLOCK_MONOTONIC_RAW
> domain. The more time that passes between refreshes, the bigger the time jump
> from the guest's perspective.
>
> E.g. IIUC, your proposed patch to use a single RDTSC[*] eliminates the drift by
> undoing the CLOCK_MONOTONIC_RAW crap using the same TSC value on both the "add"
> and the "subtract", but the underlying train wreck of mixing time domains is
> still there.
>
> Without a constant TSC, deferring the reference time to the host's computation
> makes sense (or at least, is less silly), because the effective TSC would be per
> physical CPU, whereas the reference time is per VM.
>
> [*] https://urldefense.com/v3/__https://lore.kernel.org/all/ee446c823002dc92c8ea525f21d00a9f5d27de59.camel@infradead.org__;!!ACWV5N9M2RV99hQ!L3CeHeyvOUGoCmVUUQvSn86OuB-4ZJVQ8VEm-r5hq-stW5n41h1m0K9-M8GI_abiKujrexcj-5IpD74nBA$
>
>> And if CLOCK_MONOTONIC_RAW is not what I thought it was... do we really
>> have to keep resetting the kvmclock to it at all? On modern hardware
>> can't the kvmclock be defined by the TSC alone?
>
> I think there is still use for synchronizing with the host's view of time, e.g.
> to deal with lost time across host suspend+resume.
>
> So I don't think we can completely sever KVM's paravirt clocks from host time,
> at least not without harming use cases that rely on the host's view to keep
> accurate time. And honestly at that point, the right answer would be to stop
> advertising paravirt clocks entirely.
>
> But I do think we can address the issues that Dongli and David are obversing
> where guest time drifts even though the host kernel's base time hasn't changed.
> If I've pieced everything together correctly, the drift can be eliminated simply
> by using the paravirt clock algorithm when converting the delta from the raw TSC
> to nanoseconds.
>
> This is *very* lightly tested, as in it compiles and doesn't explode, but that's
> about all I've tested.
>
> ---
> arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++++---------------
> 1 file changed, 42 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6573c89c35a9..3ba7edfca47c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2417,6 +2417,9 @@ static void kvm_get_time_scale(uint64_t scaled_hz, uint64_t base_hz,
> static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0);
> #endif
>
> +static u32 host_tsc_to_system_mul;
> +static s8 host_tsc_shift;
> +
> static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
> static unsigned long max_tsc_khz;
>
> @@ -2812,27 +2815,18 @@ static u64 read_tsc(void)
> static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
> int *mode)
> {
> - u64 tsc_pg_val;
> - long v;
> + u64 ns, v;
>
> switch (clock->vclock_mode) {
> case VDSO_CLOCKMODE_HVCLOCK:
> - if (hv_read_tsc_page_tsc(hv_get_tsc_page(),
> - tsc_timestamp, &tsc_pg_val)) {
> - /* TSC page valid */
> + if (hv_read_tsc_page_tsc(hv_get_tsc_page(), tsc_timestamp, &v))
> *mode = VDSO_CLOCKMODE_HVCLOCK;
> - v = (tsc_pg_val - clock->cycle_last) &
> - clock->mask;
> - } else {
> - /* TSC page invalid */
> + else
> *mode = VDSO_CLOCKMODE_NONE;
> - }
> break;
> case VDSO_CLOCKMODE_TSC:
> *mode = VDSO_CLOCKMODE_TSC;
> - *tsc_timestamp = read_tsc();
> - v = (*tsc_timestamp - clock->cycle_last) &
> - clock->mask;
> + v = *tsc_timestamp = read_tsc();
> break;
> default:
> *mode = VDSO_CLOCKMODE_NONE;
> @@ -2840,8 +2834,36 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
>
> if (*mode == VDSO_CLOCKMODE_NONE)
> *tsc_timestamp = v = 0;
> + else
> + v = (v - clock->cycle_last) & clock->mask;
>
> - return v * clock->mult;
> + ns = clock->base_cycles;
> +
> + /*
> + * When the clock source is a raw, constant TSC, do the conversion to
> + * nanoseconds using the paravirt clock math so that the delta in ns is
> + * consistent regardless of whether the delta is converted in the host
> + * or the guest.
> + *
> + * The base for paravirt clocks is the kernel's base time in ns, plus
> + * the delta since the last sync. E.g. if a masterclock update occurs,
> + * KVM will shift some amount of delta from the guest to the host.
> + * Conversions from TSC to ns for the hosts's CLOCK_MONOTONIC_RAW and
> + * paravirt clocks aren't equivalent, and so shifting the delta can
> + * cause time to jump from the guest's view of the paravirt clock.
> + * This only works for a constant TSC, otherwise the calculation would
> + * only be valid for the current physical CPU, whereas the base of the
> + * clock must be valid for all vCPUs in the VM.
> + */
> + if (static_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> + *mode == VDSO_CLOCKMODE_TSC && clock == &pvclock_gtod_data.raw_clock) {
> + ns >>= clock->shift;
> + ns += pvclock_scale_delta(v, host_tsc_to_system_mul, host_tsc_shift);
> + } else {
> + ns += v * clock->mult;
> + ns >>= clock->shift;
> + }
> + return ns;
> }
>
> static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
> @@ -2853,9 +2875,7 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
>
> do {
> seq = read_seqcount_begin(&gtod->seq);
> - ns = gtod->raw_clock.base_cycles;
> - ns += vgettsc(&gtod->raw_clock, tsc_timestamp, &mode);
> - ns >>= gtod->raw_clock.shift;
> + ns = vgettsc(&gtod->raw_clock, tsc_timestamp, &mode);
> ns += ktime_to_ns(ktime_add(gtod->raw_clock.offset, gtod->offs_boot));
> } while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
> *t = ns;
> @@ -2873,9 +2893,7 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
> do {
> seq = read_seqcount_begin(&gtod->seq);
> ts->tv_sec = gtod->wall_time_sec;
> - ns = gtod->clock.base_cycles;
> - ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
> - ns >>= gtod->clock.shift;
> + ns = vgettsc(&gtod->clock, tsc_timestamp, &mode);
> } while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
>
> ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
> @@ -12185,6 +12203,10 @@ int kvm_arch_hardware_enable(void)
> if (ret != 0)
> return ret;
>
> + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
> + kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1000LL,
> + &host_tsc_shift, &host_tsc_to_system_mul);

I agree that to use the kvmclock to calculate the ns elapsed when updating the
master clock.

Would you take the tsc scaling into consideration?

While the host_tsc_shift and host_tsc_to_system_mul are pre-computed, how about
the VM using different TSC frequency?

Thank you very much!

Dongli Zhang


> +
> local_tsc = rdtsc();
> stable = !kvm_check_tsc_unstable();
> list_for_each_entry(kvm, &vm_list, vm_list) {
>
> base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d