Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically

From: Dongli Zhang
Date: Fri Sep 29 2023 - 16:16:14 EST


Hi Sean,

On 9/28/23 09:18, Sean Christopherson wrote:
> On Tue, Sep 26, 2023, Dongli Zhang wrote:
>> Hi Joe,
>>
>> On 9/26/23 17:29, Joe Jin wrote:
>>> On 9/26/23 4:06 PM, Dongli Zhang wrote:
>>>> This is to minimize the kvmclock drift during CPU hotplug (or when the
>>>> master clock and pvclock_vcpu_time_info are updated).
>
> Updated by who?

This is about vCPU hotplug, e.g., when we run "device_add
host-x86_64-cpu,id=core4,socket-id=0,core-id=4,thread-id=0" at QEMU side.

When we add a new vCPU to KVM, the
kvm_synchronize_tsc()-->kvm_track_tsc_matching() triggers
KVM_REQ_MASTERCLOCK_UPDATE.

When the new vCPU is onlined for the first time at VM side, the handler of
KVM_REQ_MASTERCLOCK_UPDATE (that is, kvm_update_masterclock()) updates the
master clock (e.g., master_kernel_ns and master_cycle_now, based on the host raw
monotonic).

The kvm_update_masterclock() finally triggers KVM_REQ_CLOCK_UPDATE so that the
master_kernel_ns and the master_cycle_now are propagated to:

- pvclock_vcpu_time_info->system_time
- pvclock_vcpu_time_info->tsc_timestamp

That is ...

- vcpu->hv_clock.system_time
- vcpu->hv_clock.tsc_timestamp

>
>>>> Since kvmclock and raw monotonic (clocksource_tsc) use different
>>>> equation/mult/shift to convert the tsc to nanosecond, there may be clock
>>>> drift issue during CPU hotplug (when the master clock is updated).
>
> Based on #4, I assume you mean "vCPU hotplug from the host", but from this and
> the above it's not clear if this means "vCPU hotplug from the host", "pCPU hotplug
> in the host", or "CPU hotplug in the guest".

It is about vCPU hotplug from the host (e.g., QEMU), although the
KVM_REQ_MASTERCLOCK_UPDATE handler (kvm_update_masterclock()) is triggered when
the new vCPU is onlined (usually via udev) at VM side for the first time.

1. QEMU adds new vCPU to KVM VM
2. VM side uses udev or manual echo command to sysfs to online the vCPU

>
>>>> 1. The guest boots and all vcpus have the same 'pvclock_vcpu_time_info'
>>>> (suppose the master clock is used).
>>>>
>>>> 2. Since the master clock is never updated, the periodic kvmclock_sync_work
>>>> does not update the values in 'pvclock_vcpu_time_info'.
>>>>
>>>> 3. Suppose a very long period has passed (e.g., 30-day).
>>>>
>>>> 4. The user adds another vcpu. Both master clock and
>>>> 'pvclock_vcpu_time_info' are updated, based on the raw monotonic.
>
> So why can't KVM simply force a KVM_REQ_MASTERCLOCK_UPDATE request when a vCPU
> is added? I'm missing why this needs a persistent, periodic refresh.

Sorry for making the commit message confusing.

There is always a KVM_REQ_MASTERCLOCK_UPDATE request when a vCPU is added.
However, the problem is: only the vCPU hotplug triggers
KVM_REQ_MASTERCLOCK_UPDATE (suppose without live migration). That is, generally
(e.g., without migration), there will be no master clock update if we do not do
vCPU hot-add during long period of time.

We want more frequent KVM_REQ_MASTERCLOCK_UPDATE.

This is because:

1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation.

2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation.

3. As a result, given the same rdtsc, kvmclock and raw monotonic may return
different results (this is expected because they have different
mult/shift/equation).

4. However, the base in kvmclock calculation (tsc_timestamp and system_time)
are derived from raw monotonic clock (master clock)


When tsc_timestamp and system_time are updated:

tsc_diff = tsc_timestamp_new - tsc_timestamp_old
system_time_new = system_time_old + (incremental from raw clock source) <--- (1)

However, from kvmclock, it expects:

system_time_new = system_time_old + kvmclock_equation(tsc_diff) <--- (2)


There is diff between (1) and (2). That will be the reason of kvmclock drift
when we add a new vCPU.

Indeed, the drift is between:

(3) incremental from raw clock source (that is, tsc_equation(tsc_diff)), and

(4) kvmclock_equation(tsc_diff)

The less frequent that master clock is updated, the larger the tsc_diff. As a
result, the larger the drift.


We would like to update the master clock more frequently to reduce the tsc_diff.

>
>>>> @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
>>>> static bool __read_mostly kvmclock_periodic_sync = true;
>>>> module_param(kvmclock_periodic_sync, bool, S_IRUGO);
>>>>
>>>> +unsigned int __read_mostly masterclock_sync_period;
>>>> +module_param(masterclock_sync_period, uint, 0444);
>>>
>>> Can the mode be 0644 and allow it be changed at runtime?
>>
>> It can be RW.
>>
>> So far I just copy from kvmclock_periodic_sync as most code are from the
>> mechanism of kvmclock_periodic_sync.
>>
>> static bool __read_mostly kvmclock_periodic_sync = true;
>> module_param(kvmclock_periodic_sync, bool, S_IRUGO);
>
> Unless there's a very good reason for making it writable, I vote to keep it RO
> to simplify the code.

Unless there's a very good reason for making it writable, I vote to keep it RO
to simplify the code.


I will keep it as readable. Thank you very much!

>
>>>> /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
>>>> static u32 __read_mostly tsc_tolerance_ppm = 250;
>>>> module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
>>>> @@ -3298,6 +3301,31 @@ static void kvmclock_sync_fn(struct work_struct *work)
>>>> KVMCLOCK_SYNC_PERIOD);
>>>> }
>>>>
>>>> +static void masterclock_sync_fn(struct work_struct *work)
>>>> +{
>>>> + unsigned long i;
>>>> + struct delayed_work *dwork = to_delayed_work(work);
>>>> + struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
>>>> + masterclock_sync_work);
>>>> + struct kvm *kvm = container_of(ka, struct kvm, arch);
>>>> + struct kvm_vcpu *vcpu;
>>>> +
>>>> + if (!masterclock_sync_period)
>
> This function should never be called if masterclock_sync_period=0. The param
> is RO and so kvm_arch_vcpu_postcreate() shouldn't create the work in the first
> place.

This function should never be called if masterclock_sync_period=0. The param
is RO and so kvm_arch_vcpu_postcreate() shouldn't create the work in the first
place.



Thank you very much for pointing out that. I just copied the code from
kvmclock_sync_fn() (although I have noticed that :) )

I think I may need to send a cleanup patch to remove line 3296-3297 from
existing code as well.

>
>>>> + return;
>>>> +
>>>> + kvm_for_each_vcpu(i, vcpu, kvm) {
>>>> + /*
>>>> + * It is not required to kick the vcpu because it is not
>>>> + * expected to update the master clock immediately.
>>>> + */
>
> This comment needs to explain *why* it's ok for vCPUs to lazily handle the
> masterclock update. Saying "it is not expected" doesn't help understand who/what
> expects anything, or why.

I will update the comment as:

The objective to update the master clock is to reduce (but not to avoid) the
clock drift when there is long period of time between two master clock updates.
It is not expected to update immediately. It is fine to wait until next vCPU entry.


Please let me know if any clarification is needed. I used the below patch to
help diagnose the drift issue.

@@ -3068,6 +3110,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
u64 tsc_timestamp, host_tsc;
u8 pvclock_flags;
bool use_master_clock;
+ struct pvclock_vcpu_time_info old_hv_clock;
+ u64 tsc_raw, tsc, old_ns, new_ns, diff;
+ bool backward;
+
+ memcpy(&old_hv_clock, &vcpu->hv_clock, sizeof(old_hv_clock));

kernel_ns = 0;
host_tsc = 0;
@@ -3144,6 +3191,25 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)

vcpu->hv_clock.flags = pvclock_flags;

+ tsc_raw = rdtsc();
+ tsc = kvm_read_l1_tsc(v, tsc_raw);
+ old_ns = __pvclock_read_cycles(&old_hv_clock, tsc);
+ new_ns = __pvclock_read_cycles(&vcpu->hv_clock, tsc);
+ if (old_ns > new_ns) {
+ backward = true;
+ diff = old_ns - new_ns;
+ } else {
+ backward = false;
+ diff = new_ns - old_ns;
+ }
+ pr_alert("backward=%d, diff=%llu, old_ns=%llu, new_ns=%llu\n",
+ backward, diff, old_ns, new_ns);


Thank you very much!

Dongli Zhang

>
>>>> + kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
>>>> + }
>>>> +
>>>> + schedule_delayed_work(&ka->masterclock_sync_work,
>>>> + masterclock_sync_period * HZ);
>>>> +}