Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically

From: Dongli Zhang
Date: Wed Oct 04 2023 - 15:15:02 EST


Hi Sean,

On 10/4/23 11:06 AM, Sean Christopherson wrote:
> On Wed, Oct 04, 2023, David Woodhouse wrote:
>> On Tue, 2023-10-03 at 17:04 -0700, Sean Christopherson wrote:
>>>> Can't we ensure that the kvmclock uses the *same* algorithm,
>>>> precisely, as CLOCK_MONOTONIC_RAW?
>>>
>>> Yes?  At least for sane hardware, after much staring, I think it's possible.
>>>
>>> It's tricky because the two algorithms are wierdly different, the PV clock algorithm
>>> is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh
>>> at us for suggesting that we try to shove the pvclock algorithm into the kernel.
>>>
>>> The hardcoded shift right 32 in PV clock is annoying, but not the end of the world.
>>>
>>> Compile tested only, but I believe this math is correct.  And I'm guessing we'd
>>> want some safeguards against overflow, e.g. due to a multiplier that is too big.
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 6573c89c35a9..ae9275c3d580 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>>>                                             v->arch.l1_tsc_scaling_ratio);
>>>  
>>>         if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
>>> -               kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
>>> -                                  &vcpu->hv_clock.tsc_shift,
>>> -                                  &vcpu->hv_clock.tsc_to_system_mul);
>>> +               u32 shift, mult;
>>> +
>>> +               clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600);
>>> +
>>> +               if (shift <= 32) {
>>> +                       vcpu->hv_clock.tsc_shift = 0;
>>> +                       vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift);
>>> +               } else {
>>> +                       kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
>>> +                                          &vcpu->hv_clock.tsc_shift,
>>> +                                          &vcpu->hv_clock.tsc_to_system_mul);
>>> +               }
>>> +
>>>                 vcpu->hw_tsc_khz = tgt_tsc_khz;
>>>                 kvm_xen_update_tsc_info(v);
>>>         }
>>>
>>
>> I gave that a go on my test box, and for a TSC frequency of 2593992 kHz
>> it got mult=1655736523, shift=32 and took the 'happy' path instead of
>> falling back.
>>
>> It still drifts about the same though, using the same test as before:
>> https://urldefense.com/v3/__https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kvmclock__;!!ACWV5N9M2RV99hQ!KEdDuRZIThXoz2zaZd3O9rk77ywSaHCQ92fTnc7PFP81bdOhTMvudMBReIfZcrm9AITeKw4kyMTmbPbJuA$
>>
>>
>> I was going to facetiously suggest that perhaps the kvmclock should
>> have leap nanoseconds... but then realised that that's basically what
>> Dongli's patch is *doing*. Maybe we just need to *recognise* that,
>
> Yeah, I suspect trying to get kvmclock to always precisely align with the kernel's
> monotonic raw clock is a fool's errand.
>
>> so rather than having a user-configured period for the update, KVM could
>> calculate the frequency for the updates based on the rate at which the clocks
>> would otherwise drift, and a maximum delta? Not my favourite option, but
>> perhaps better than nothing?
>
> Holy moly, the existing code for the periodic syncs/updates is a mess. If I'm
> reading the code correctly, commits
>
> 0061d53daf26 ("KVM: x86: limit difference between kvmclock updates")
> 7e44e4495a39 ("x86: kvm: rate-limit global clock updates")
> 332967a3eac0 ("x86: kvm: introduce periodic global clock updates")
>
> splattered together an immpressively inefficient update mechanism.
>
> On the first vCPU creation, KVM schedules kvmclock_sync_fn() at a hardcoded rate
> of 300hz.
>
> if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0)
> schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> KVMCLOCK_SYNC_PERIOD);
>
> That handler does two things: schedule "delayed" work kvmclock_update_fn() to
> be executed immediately, and reschedule kvmclock_sync_fn() at 300hz.
> kvmclock_sync_fn() then kicks *every* vCPU in the VM, i.e. KVM kicks every vCPU
> to sync kvmlock at a 300hz frequency.
>
> If we're going to kick every vCPU, then we might as well do a masterclock update,
> because the extra cost of synchronizing the masterclock is likely in the noise
> compared to kicking every vCPU. There's also zero reason to do the work in vCPU
> context.
>
> And because that's not enough, on pCPU migration or if the TSC is unstable,
> kvm_arch_vcpu_load() requests KVM_REQ_GLOBAL_CLOCK_UPDATE, which schedules
> kvmclock_update_fn() with a delay of 100ms. The large delay is to play nice with
> unstable TSCs. But if KVM is periodically doing clock updates on all vCPU,
> scheduling another update with a *longer* delay is silly.

We may need to add above message to the places, where
KVM_REQ_GLOBAL_CLOCK_UPDATE is replaced with KVM_REQ_CLOCK_UPDATE in the patch?

This helps understand why KVM_REQ_CLOCK_UPDATE is sometime enough.

>
> The really, really stupid part of all is that the periodic syncs happen even if
> kvmclock isn't exposed to the guest. *sigh*
>
> So rather than add yet another periodic work function, I think we should clean up
> the mess we have, fix the whole "leapseconds" mess with the masterclock, and then
> tune the frequency (if necessary).
>
> Something like the below is what I'm thinking. Once the dust settles, I'd like
> to do dynamically enable/disable kvmclock_sync_work based on whether or not the
> VM actually has vCPU's with a pvclock, but that's definitely an enhancement that
> can go on top.
>
> Does this look sane, or am I missing something?
>
> ---
> arch/x86/include/asm/kvm_host.h | 3 +-
> arch/x86/kvm/x86.c | 53 +++++++++++----------------------
> 2 files changed, 19 insertions(+), 37 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 34a64527654c..d108452fc301 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -98,7 +98,7 @@
> KVM_ARCH_REQ_FLAGS(14, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQ_SCAN_IOAPIC \
> KVM_ARCH_REQ_FLAGS(15, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> -#define KVM_REQ_GLOBAL_CLOCK_UPDATE KVM_ARCH_REQ(16)
> +/* AVAILABLE BIT!!!! KVM_ARCH_REQ(16) */
> #define KVM_REQ_APIC_PAGE_RELOAD \
> KVM_ARCH_REQ_FLAGS(17, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQ_HV_CRASH KVM_ARCH_REQ(18)
> @@ -1336,7 +1336,6 @@ struct kvm_arch {
> bool use_master_clock;
> u64 master_kernel_ns;
> u64 master_cycle_now;
> - struct delayed_work kvmclock_update_work;
> struct delayed_work kvmclock_sync_work;
>
> struct kvm_xen_hvm_config xen_hvm_config;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6573c89c35a9..5d35724f1963 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2367,7 +2367,7 @@ static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
> }
>
> vcpu->arch.time = system_time;
> - kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
> + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

As mentioned above, we may need a comment here to explain why
KVM_REQ_CLOCK_UPDATE on the only vcpu is enough.

>
> /* we verify if the enable bit is set... */
> if (system_time & 1)
> @@ -3257,30 +3257,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>
> #define KVMCLOCK_UPDATE_DELAY msecs_to_jiffies(100)
>
> -static void kvmclock_update_fn(struct work_struct *work)
> -{
> - unsigned long i;
> - struct delayed_work *dwork = to_delayed_work(work);
> - struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
> - kvmclock_update_work);
> - struct kvm *kvm = container_of(ka, struct kvm, arch);
> - struct kvm_vcpu *vcpu;
> -
> - kvm_for_each_vcpu(i, vcpu, kvm) {
> - kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> - kvm_vcpu_kick(vcpu);
> - }
> -}
> -
> -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
> -{
> - struct kvm *kvm = v->kvm;
> -
> - kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
> - schedule_delayed_work(&kvm->arch.kvmclock_update_work,
> - KVMCLOCK_UPDATE_DELAY);
> -}
> -
> #define KVMCLOCK_SYNC_PERIOD (300 * HZ)

While David mentioned "maximum delta", how about to turn above into a module
param with the default 300HZ.

BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour
or 1-day.

>
> static void kvmclock_sync_fn(struct work_struct *work)
> @@ -3290,12 +3266,14 @@ static void kvmclock_sync_fn(struct work_struct *work)
> kvmclock_sync_work);
> struct kvm *kvm = container_of(ka, struct kvm, arch);
>
> - if (!kvmclock_periodic_sync)
> - return;
> + if (ka->use_master_clock)
> + kvm_update_masterclock(kvm);

Based on the source code, I think it is safe to call kvm_update_masterclock() here.

We want the masterclock to update only once. To call KVM_REQ_MASTERCLOCK_UPDATE
for each vCPU here is meaningless.

I just want to remind that this is shared workqueue. The workqueue stall
detection may report false positive (e.g., due to tsc_write_lock contention.
That should not be lock contensive).


Thank you very much!

Dongli Zhang

> + else
> + kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);
>
> - schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
> - schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> - KVMCLOCK_SYNC_PERIOD);
> + if (kvmclock_periodic_sync)
> + schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> + KVMCLOCK_SYNC_PERIOD);
> }
>
> /* These helpers are safe iff @msr is known to be an MCx bank MSR. */
> @@ -4845,7 +4823,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> * kvmclock on vcpu->cpu migration
> */
> if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
> - kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
> + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> if (vcpu->cpu != cpu)
> kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu);
> vcpu->cpu = cpu;
> @@ -10520,12 +10498,19 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> __kvm_migrate_timers(vcpu);
> if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
> kvm_update_masterclock(vcpu->kvm);
> - if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))
> - kvm_gen_kvmclock_update(vcpu);
> if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
> r = kvm_guest_time_update(vcpu);
> if (unlikely(r))
> goto out;
> +
> + /*
> + * Ensure all other vCPUs synchronize "soon", e.g. so
> + * that all vCPUs recognize NTP corrections and drift
> + * corrections (relative to the kernel's raw clock).
> + */
> + if (!kvmclock_periodic_sync)
> + schedule_delayed_work(&vcpu->kvm->arch.kvmclock_sync_work,
> + KVMCLOCK_UPDATE_DELAY);
> }
> if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
> kvm_mmu_sync_roots(vcpu);
> @@ -12345,7 +12330,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> kvm->arch.hv_root_tdp = INVALID_PAGE;
> #endif
>
> - INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
> INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
>
> kvm_apicv_init(kvm);
> @@ -12387,7 +12371,6 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm)
> void kvm_arch_sync_events(struct kvm *kvm)
> {
> cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work);
> - cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work);
> kvm_free_pit(kvm);
> }
>
>
> base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d