Re: [PATCH v6] KVM: x86/tsc: Don't sync user-written TSC against startup values

From: Like Xu
Date: Mon Sep 25 2023 - 03:37:10 EST


Ping for further comments and confirmation from Sean.
Could we trigger a new version for this issue ? Thanks.

On 19/9/2023 7:29 pm, Like Xu wrote:
On 14/9/2023 3:31 pm, David Woodhouse wrote:
On Thu, 2023-09-14 at 11:50 +0800, Like Xu wrote:
On 13/9/2023 10:47 pm, Sean Christopherson wrote:
On Wed, Sep 13, 2023, Like Xu wrote:
I'll wait for a cooling off period to see if the maintainers need me to post v7.

You should have waiting to post v5, let alone v6.  Resurrecting a thread after a
month and not waiting even 7 hours for others to respond is extremely frustrating.

You are right. I don't seem to be keeping up with many of other issues. Sorry
for that.
Wish there were 48 hours in a day.

Back to this issue: for commit message, I'd be more inclined to David's
understanding,

The discussion that Sean and I had should probably be reflected in the
commit message too. To the end of the commit log you used for v6, after
the final 'To that end:…' paragraph, let's add:

  Note that userspace can explicitly request a *synchronization* of the
  TSC by writing zero. For the purpose of this patch, this counts as
  "setting" the TSC. If userspace then subsequently writes an explicit
  non-zero value which happens to be within 1 second of the previous
  value, it will be 'corrected'. For that case, this preserves the prior
  behaviour of KVM (which always applied the 1-second 'correction'
  regardless of user vs. kernel).

@@ -2728,27 +2729,45 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu,
u64 data)
         elapsed = ns - kvm->arch.last_tsc_nsec;

         if (vcpu->arch.virtual_tsc_khz) {
+               /*
+                * Force synchronization when creating or hotplugging a vCPU,
+                * i.e. when the TSC value is '0', to help keep clocks stable.
+                * If this is NOT a hotplug/creation case, skip synchronization
+                * on the first write from userspace so as not to misconstrue
+                * state restoration after live migration as an attempt from
+                * userspace to synchronize.
+                */

You cannot *misconstrue* an attempt from userspace to synchronize. If
userspace writes a zero, it's a sync attempt. If it's non-zero it's a
TSC value to be set. It's not very subtle :)

I think the 1-second slop thing is sufficiently documented in the 'else
if' clause below, so I started writing an alternative 'overall' comment
to go here and found it a bit redundant. So maybe let's just drop this
comment and add one back in the if (data == 0) case...

                 if (data == 0) {
-                       /*
-                        * detection of vcpu initialization -- need to sync
-                        * with other vCPUs. This particularly helps to keep
-                        * kvm_clock stable after CPU hotplug
-                        */


             /*
              * Force synchronization when creating a vCPU, or when
              * userspace explicitly writes a zero value.
              */

                         synchronizing = true;
-               } else {
+               } else if (kvm->arch.user_set_tsc) {
                         u64 tsc_exp = kvm->arch.last_tsc_write +
                                                 nsec_to_cycles(vcpu, elapsed);
                         u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
                         /*
-                        * Special case: TSC write with a small delta (1 second)
-                        * of virtual cycle time against real time is
-                        * interpreted as an attempt to synchronize the CPU.
+                        * Here lies UAPI baggage: when a user-initiated TSC write has
+                        * a small delta (1 second) of virtual cycle time against the
+                        * previously set vCPU, we assume that they were intended to be
+                        * in sync and the delta was only due to the racy nature of the
+                        * legacy API.
+                        *
+                        * This trick falls down when restoring a guest which genuinely
+                        * has been running for less time than the 1 second of imprecision
+                        * which we allow for in the legacy API. In this case, the first
+                        * value written by userspace (on any vCPU) should not be subject
+                        * to this 'correction' to make it sync up with values that only

Missing the word 'come' here too, in '…that only *come* from…',

+                        * from the kernel's default vCPU creation. Make the 1-second slop
+                        * hack only trigger if the user_set_tsc flag is already set.
+                        *
+                        * The correct answer is for the VMM not to use the legacy API.

Maybe we should drop this line, as we don't actually have a sane API
yet that VMMs can use instead.


Thanks for your comments, but not sure if Sean has any more concerns to move forward:

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1a4def36d5bb..9a7dfef9d32d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1324,6 +1324,7 @@ struct kvm_arch {
     int nr_vcpus_matched_tsc;

     u32 default_tsc_khz;
+    bool user_set_tsc;

     seqcount_raw_spinlock_t pvclock_sc;
     bool use_master_clock;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c9c81e82e65..11fbd2a4a370 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2714,8 +2714,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
     kvm_track_tsc_matching(vcpu);
 }

-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
 {
+    u64 data = user_value ? *user_value : 0;
     struct kvm *kvm = vcpu->kvm;
     u64 offset, ns, elapsed;
     unsigned long flags;
@@ -2730,25 +2731,37 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
     if (vcpu->arch.virtual_tsc_khz) {
         if (data == 0) {
             /*
-             * detection of vcpu initialization -- need to sync
-             * with other vCPUs. This particularly helps to keep
-             * kvm_clock stable after CPU hotplug
+             * Force synchronization when creating a vCPU, or when
+             * userspace explicitly writes a zero value.
              */
             synchronizing = true;
-        } else {
+        } else if (kvm->arch.user_set_tsc) {
             u64 tsc_exp = kvm->arch.last_tsc_write +
                         nsec_to_cycles(vcpu, elapsed);
             u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
             /*
-             * Special case: TSC write with a small delta (1 second)
-             * of virtual cycle time against real time is
-             * interpreted as an attempt to synchronize the CPU.
+             * Here lies UAPI baggage: when a user-initiated TSC write has
+             * a small delta (1 second) of virtual cycle time against the
+             * previously set vCPU, we assume that they were intended to be
+             * in sync and the delta was only due to the racy nature of the
+             * legacy API.
+             *
+             * This trick falls down when restoring a guest which genuinely
+             * has been running for less time than the 1 second of imprecision
+             * which we allow for in the legacy API. In this case, the first
+             * value written by userspace (on any vCPU) should not be subject
+             * to this 'correction' to make it sync up with values that only
+             * come from the kernel's default vCPU creation. Make the 1-second
+             * slop hack only trigger if the user_set_tsc flag is already set.
              */
             synchronizing = data < tsc_exp + tsc_hz &&
                     data + tsc_hz > tsc_exp;
         }
     }

+    if (user_value)
+        kvm->arch.user_set_tsc = true;
+
     /*
      * For a reliable TSC, we can match TSC offsets, and for an unstable
      * TSC, we add elapsed time in this computation.  We could let the
@@ -3777,7 +3790,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
         break;
     case MSR_IA32_TSC:
         if (msr_info->host_initiated) {
-            kvm_synchronize_tsc(vcpu, data);
+            kvm_synchronize_tsc(vcpu, &data);
         } else {
             u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
             adjust_tsc_offset_guest(vcpu, adj);
@@ -5536,6 +5549,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
         tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
         ns = get_kvmclock_base_ns();

+        kvm->arch.user_set_tsc = true;
         __kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched);
         raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);

@@ -11959,7 +11973,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
     if (mutex_lock_killable(&vcpu->mutex))
         return;
     vcpu_load(vcpu);
-    kvm_synchronize_tsc(vcpu, 0);
+    kvm_synchronize_tsc(vcpu, NULL);
     vcpu_put(vcpu);

     /* poll control enabled by default */