Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically

From: Dongli Zhang
Date: Mon Oct 16 2023 - 13:08:02 EST


Hi David,

On 10/16/23 09:25, David Woodhouse wrote:
> On Mon, 2023-10-16 at 08:47 -0700, Dongli Zhang wrote:
>> Hi David and Sean,
>>
>> On 10/14/23 02:49, David Woodhouse wrote:
>>>
>>>
>>> On 14 October 2023 00:26:45 BST, Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>>>>> 2. Suppose the KVM host has been running for long time, and the drift between
>>>>> two domains would be accumulated to super large? (Even it may not introduce
>>>>> anything bad immediately)
>>>>
>>>> That already happens today, e.g. unless the host does vCPU hotplug or is using
>>>> XEN's shared info page, masterclock updates effectively never happen.  And I'm
>>>> not aware of a single bug report of someone complaining that kvmclock has drifted
>>>> from the host clock.  The only bug reports we have are when KVM triggers an update
>>>> and causes time to jump from the guest's perspective.
>>>
>>> I've got reports about the Xen clock going backwards, and also
>>> about it drifting over time w.r.t. the guest's TSC clocksource so
>>> the watchdog in the guest declares its TSC clocksource unstable.
>>
>> I assume you meant Xen on KVM (not Xen guest on Xen hypervisor). According to my
>> brief review of xen hypervisor code, it looks using the same algorithm to
>> calculate the clock at hypervisor side, as in the xen guest.
>
> Right. It's *exactly* the same thing. Even the same pvclock ABI in the
> way it's exposed to the guest (in the KVM case via the MSR, in the Xen
> case it's in the vcpu_info or a separate vcpu_time_info set up by Xen
> hypercalls).
>
>> Fortunately, the "tsc=reliable" my disable the watchdog, but I have no idea if
>> it impacts Xen on KVM.
>
> Right. I think Linux as a KVM guest automatically disables the
> watchdog, or at least refuses to use the KVM clock as the watchdog for
> the TSC clocksource?

You may refer to the below commit, which disables watchdog for tsc when it is
reliable.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b50db7095fe002fa3e16605546cba66bf1b68a3e

>
> Xen guests, on the other hand, aren't used to the Xen clock being as
> unreliable as the KVM clock is, so they *do* use it as a watchdog for
> the TSC clocksource.
>
>>> I don't understand *why* we update the master lock when we populate
>>> the Xen shared info. Or add a vCPU, for that matter.
>
> Still don't...

I do not have much knowledge on Xen-on-KVM. I assume both that and kvmclock are
the similar things.

The question is: why to update master clock when adding new vCPU (e.g., via QEMU)?

It is already in the source code, and TBH, I do not know why it is in the source
code like that.


Just to explain the source code, taking QEMU + KVM as an example:

1. QEMU adds new vCPU to the running guest.

2. QEMU userspace triggers KVM kvm_synchronize_tsc() via ioctl.

kvm_synchronize_tsc()-->__kvm_synchronize_tsc()-->kvm_track_tsc_matching()

The above tries to sync TSC, and finally sets KVM_REQ_MASTERCLOCK_UPDATE pending
for the new vCPU.


3. The guest side onlines the new vCPU via either udev rule (automatically), or
sysfs (echo and manually).

4. When the vCPU is onlined, it will be starting running at KVM side.

The KVM sides processes KVM_REQ_MASTERCLOCK_UPDATE before entering into the
guest mode.

5. The handler of KVM_REQ_MASTERCLOCK_UPDATE updates the master clock.

>
>>>>> The idea is to never update master clock, if tsc is stable (and masterclock is
>>>>> already used).
>>>>
>>>> That's another option, but if there are no masterclock updates, then it suffers
>>>> the exact same (theoretical) problem as #2.  And there are real downsides, e.g.
>>>> defining when KVM would synchronize kvmclock with the host clock would be
>>>> significantly harder...
>>>
>>> I thought the definition of such an approach would be that we
>>> *never* resync the kvmclock to anything. It's based purely on the
>>> TSC value when the guest started, and the TSC frequency. The
>>> pvclock we advertise to all vCPUs would be the same, and would
>>> *never* change except on migration.
>>>
>>> (I guess that for consistency we would scale first to the *guest*
>>> TSC and from that to nanoseconds.)
>>>
>>> If userspace does anything which makes that become invalid,
>>> userspace gets to keep both pieces. That includes userspace having
>>> to deal with host suspend like migration, etc.
>>
>> Suppose we are discussing a non-permanenet solution, I would suggest:
>>
>> 1. Document something to accept that kvm-clock (or pvclock on KVM, including Xen
>> on KVM) is not good enough in some cases, e.g., vCPU hotplug.
>
> I still don't understand the vCPU hotplug case.
>
> In the case where the TSC is actually sane, why would we need to reset
> the masterclock on vCPU hotplug?
>
> The new vCPU gets its TSC synchronised to the others, and its kvmclock
> parameters (mul/shift/offset based on the guest TSC) can be *precisely*
> the same as the other vCPUs too, can't they? Why reset anything?

While I understand how source code works, I do not know why.

I shared the below patch from my prior diagnostic kernel, and it avoids updating
the master clock, if it is already used and stable.

https://lore.kernel.org/kvm/cf2b22fc-78f5-dfb9-f0e6-5c4059a970a2@xxxxxxxxxx/

>
>> 2. Do not reply on any userspace change, so that the solution can be easier to
>> apply to existing environments running old KVM versions.
>>
>> That is, to limit the change within KVM.
>>
>> 3. The options would be to (1) stop updating masterclock in the ideal scenario
>> (e.g., stable tsc), or to (2) refresh periodically to minimize the drift.
>
> If the host TSC is sane, just *never* update the KVM masterclock. It
> "drifts" w.r.t. the host CLOCK_MONOTONIC_RAW and nobody will ever care.

I think it is one of the two options, although I prefer the 2 than the 1.

1. Do not update master clock.

2. Refresh master clock periodically.

>
> The only opt-in we need from userspace for that is to promise that the
> host TSC will never get mangled, isn't it?

Regarding QEMU, I assume you meant either:

(1) -cpu host,+invtsc (at QEMU command line), or
(2) tsc=reliable (at guest kernel command line)

>
> (We probably want to be able to export the pvclock information to
> userspace (in terms of the mul/shift/offset from host TSC to guest TSC
> and then the mul/shift/offset to kvmclock). Userspace may want to make
> things like the PIT/HPET/PMtimer run on that clock.)
>


Thank you very much!

Dongli Zhang