Re: [PATCH 0/2] RFC: Precise TSC migration (summary)

From: Maxim Levitsky
Date: Mon Nov 30 2020 - 08:40:14 EST


This is the summary of few things that I think are relevant.

Best regards,
Maxim Levitsky
# Random unsynchronized ramblings about the TSC in KVM/Linux

## The KVM's master clock

Under assumption that

a. Host TSC is synchronized and stable (wasn't marked as unstable).

b. Guest TSC is synchronized:

- When guest starts running, all its vCPUs start from TSC = 0.

- If we hotplug a vCPU, its TSC is set to 0, but the 'kvm_synchronize_tsc'
(it was used to be called kvm_write_tsc), practically speaking,
just sets the TSC to the same value as other vCPUs are having right now.

Later Linux will try to sync it again, but since it is already synchronized
it won't do anything. Otherwise it uses IA_32_MSR_TSC_ADJUST to adjust it.

- If the guest writes the TSC we try to detect if TSC is still synced.
(We don't handle TSC adjustments done via IA_32_MSR_TSC_ADJUST).

Then the kvmclock is driven by a single pair of (nsecs, tsc) and that is used
to update the kvmclock on all vCPUs.

The advantage of this is that no random error can be introduced by calculating
this pair on each CPU.

Plus another advantage is that being vCPU invariant, guest's kvmclock
implementation can read it in userspace.
This is signaled by setting KVM_CLOCK_TSC_STABLE flag via kvmclock interface.

## KVM behavior when the host tsc is detected as unstable

* On each 'userspace' VM entry (aka 'vcpu_load'), we set guest tsc to its
last value captured on last userspace VM exit,
and we schedule a KVM clock update (KVM_REQ_GLOBAL_CLOCK_UPDATE)

* On each KVM clock update, we 'catchup' the tsc to the kernel clock.

* We don't use masterclock

## The TSC 'features' in the Linux kernel

Linux kernel has roughly speaking 4 cpu 'features' that define its treatment of TSC.
Some of these features are set from CPUID, some are set when certain CPU
models are detected, and some are set when a specific hypervisor is detected.

* X86_FEATURE_TSC_KNOWN_FREQ
This is the most harmless feature. It is set by various hypervisors,
(including kvmclock), and for some Intel models, when TSC frequency can
be obtained via an interface (PV, cpuid, msr, etc), rather than measuring it.

* X86_FEATURE_NONSTOP_TSC

On real hardware, this feature is set when CPUID has the 'invtsc' bit set.
And it tells the kernel that TSC doesn't stop in low power idle states.
When absent, an attempt to enter a low power idle state (e.g C2) will mark
the TSC as unstable.

This feature has also a friend called X86_FEATURE_NONSTOP_TSC_S3,
which doesn't do anything that is relevant to KVM.

In a VM, on one hand the vCPU is interrupted often but as long as the host TSC
is stable, the guest TSC should remain stable as well.

(The guest TSC 'keeps on running' when the guest CPU is not
running, which is the same thing as the situation in which
a real CPU is in low power/waiting state)

However not exposing this bit to the guest doesn't cause much harm, since the
guest usually doesn't use idle states, thus it never marks the TSC
as unstable due to lack of this bit.
(the exception to that is cpu-pm=on thing, which is TBD)

* X86_FEATURE_CONSTANT_TSC

On real hardware this bit informs the kernel that the TSC frequency doesn't
change with CPU frequency changes. If it is not set, on first cpufreq update,
the tsc is marked as unstable.
On real hardware this bit is also set by 'invtsc' CPUID bit, plus set on few
older intel models which lack it but still have constant TSC.

In a VM, once again there is no cpufreq driver, thus the lack of this bit
doesn't cause much harm.

However if a VM is migrated to a host with a different CPU frequency,
and TSC scaling is not supported then the guest will see a jump in the
frequency.

This is why qemu doesn't enable 'invtsc' by default, and blocks migration
when enabled, unless you also tell explicitly what frequency the guest’s TSC
should run at and in this case it will attempt to scale host TSC to this
value, using hardware means and I think qemu will fail vm start
if it fails.

* X86_FEATURE_TSC_RELIABLE

This is feature only for virtualization (although I have seen some Atom
specific code piggyback on it), and it makes the guest trust the TSC more
(kind of 'trust us, you can count on TSC')

It makes the guest do two things.
1. Disable the clocksource watchdog (which is a periodic mechanism,
by which the kernel compares TSC with other clocksources,
and if it sees something fishy, it marks the tsc as unstable)

2. Disable TSC sync on new vCPU hotplug, instead relying on hypervisor
to sync it.
(IMHO we should set this flag for KVM as well)