Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically

From: Dongli Zhang
Date: Mon Oct 02 2023 - 13:17:44 EST


Hi Sean and David,

On 10/2/23 09:37, Sean Christopherson wrote:
> On Mon, Oct 02, 2023, David Woodhouse wrote:
>> On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote:
>>>
>>>
>>> We want more frequent KVM_REQ_MASTERCLOCK_UPDATE.
>>>
>>> This is because:
>>>
>>> 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation.
>>>
>>> 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation.
>>>
>>> 3. As a result, given the same rdtsc, kvmclock and raw monotonic may return
>>> different results (this is expected because they have different
>>> mult/shift/equation).
>>>
>>> 4. However, the base in  kvmclock calculation (tsc_timestamp and system_time)
>>> are derived from raw monotonic clock (master clock)
>>
>> That just seems wrong. I don't mean that you're incorrect; it seems
>> *morally* wrong.
>>
>> In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use
>> a *different* mult/shift/equation (your #1) to convert TSC ticks to
>> nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2).
>>
>> I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's
>> adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent.
>>
>> Fix that, and the whole problem goes away, doesn't it?
>>
>> What am I missing here, that means we can't do that?
>
> I believe the answer is that "struct pvclock_vcpu_time_info" and its math are
> ABI between KVM and KVM guests.
>
> Like many of the older bits of KVM, my guess is that KVM's behavior is the product
> of making things kinda sorta work with old hardware, i.e. was probably the least
> awful solution in the days before constant TSCs, but is completely nonsensical on
> modern hardware.
>
>> Alternatively... with X86_FEATURE_CONSTANT_TSC, why do the sync at all?
>> If KVM wants to decide that the TSC runs at a different frequency to
>> the frequency that the host uses for CLOCK_MONOTONIC_RAW, why can't KVM
>> just *stick* to that?
>
> Yeah, bouncing around guest time when the TSC is constant seems counterproductive.
>
> However, why does any of this matter if the host has a constant TSC? If that's
> the case, a sane setup will expose a constant TSC to the guest and the guest will
> use the TSC instead of kvmclock for the guest clocksource.
>
> Dongli, is this for long-lived "legacy" guests that were created on hosts without
> a constant TSC? If not, then why is kvmclock being used? Or heaven forbid, are
> you running on hardware without a constant TSC? :-)

This is for test guests, and the host has all of below:

tsc, rdtscp, constant_tsc, nonstop_tsc, tsc_deadline_timer, tsc_adjust


A clocksource is used for two things.


1. current_clocksource. Yes, I agree we should always use tsc on modern hardware.

Do we need to update the documentation to always suggest TSC when it is
constant, as I believe many users still prefer pv clock than tsc?

Thanks to tsc ratio scaling, the live migration will not impact tsc.

>From the source code, the rating of kvm-clock is still higher than tsc.

BTW., how about to decrease the rating if guest detects constant tsc?

166 struct clocksource kvm_clock = {
167 .name = "kvm-clock",
168 .read = kvm_clock_get_cycles,
169 .rating = 400,
170 .mask = CLOCKSOURCE_MASK(64),
171 .flags = CLOCK_SOURCE_IS_CONTINUOUS,
172 .enable = kvm_cs_enable,
173 };

1196 static struct clocksource clocksource_tsc = {
1197 .name = "tsc",
1198 .rating = 300,
1199 .read = read_tsc,


2. The sched_clock.

The scheduling is impacted if there is big drift.

Fortunately, according to my general test (without RT requirement), the impact
is trivial unless the two masterclock *updates* are between a super long period.

In the past, the scheduling was impacted a lot when the masterclock was still
based on monotonic (not raw).

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=53fafdbb8b21fa99dfd8376ca056bffde8cafc11


Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock
operations (not only sched_clock), e.g., after line 300.

296 void __init kvmclock_init(void)
297 {
298 u8 flags;
299
300 if (!kvm_para_available() || !kvmclock)
301 return;
302
303 if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
304 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
305 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
306 } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
307 return;
308 }
309
310 if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu",
311 kvmclock_setup_percpu, NULL) < 0) {
312 return;
313 }
314
315 pr_info("kvm-clock: Using msrs %x and %x",
316 msr_kvm_system_time, msr_kvm_wall_clock);
317
318 this_cpu_write(hv_clock_per_cpu, &hv_clock_boot[0]);
319 kvm_register_clock("primary cpu clock");
320 pvclock_set_pvti_cpu0_va(hv_clock_boot);
321
322 if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
323 pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
324
325 flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
326 kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
327
328 x86_platform.calibrate_tsc = kvm_get_tsc_khz;
329 x86_platform.calibrate_cpu = kvm_get_tsc_khz;
330 x86_platform.get_wallclock = kvm_get_wallclock;
331 x86_platform.set_wallclock = kvm_set_wallclock;
332 #ifdef CONFIG_X86_LOCAL_APIC
333 x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
334 #endif
335 x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
336 x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
337 kvm_get_preset_lpj();


Should I introduce a new param to disable no-kvm-sched-clock only, or to
introduce a new param to allow the selection of sched_clock?


Those may not resolve the issue in another thread. (I guess there is a chance
that to refresh the masterclock may reduce the drift in that case, although
never tried)

https://lore.kernel.org/all/00fba193-238e-49dc-fdc4-0b93f20569ec@xxxxxxxxxx/

Thank you very much!

Dongli Zhang

>
> Not saying we shouldn't sanitize the kvmclock behavior, but knowing the exact
> problematic configuration(s) will help us make a better decision on how to fix
> the mess.