Re: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

From: Vitaly Kuznetsov
Date: Thu Oct 19 2023 - 07:56:32 EST


Dongli Zhang <dongli.zhang@xxxxxxxxxx> writes:

> As mentioned in the linux kernel development document, "sched_clock() is
> used for scheduling and timestamping". While there is a default native
> implementation, many paravirtualizations have their own implementations.
>
> About KVM, it uses kvm_sched_clock_read() and there is no way to only
> disable KVM's sched_clock. The "no-kvmclock" may disable all
> paravirtualized kvmclock features.
>
> 94 static inline void kvm_sched_clock_init(bool stable)
> 95 {
> 96 if (!stable)
> 97 clear_sched_clock_stable();
> 98 kvm_sched_clock_offset = kvm_clock_read();
> 99 paravirt_set_sched_clock(kvm_sched_clock_read);
> 100
> 101 pr_info("kvm-clock: using sched offset of %llu cycles",
> 102 kvm_sched_clock_offset);
> 103
> 104 BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
> 105 sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
> 106 }
>
> There is known issue that kvmclock may drift during vCPU hotplug [1].
> Although a temporary fix is available [2], we may need a way to disable pv
> sched_clock. Nowadays, the TSC is more stable and has less performance
> overhead than kvmclock.
>
> This is to propose to introduce a global param to disable pv sched_clock
> for all paravirtualizations.
>
> Please suggest and comment if other options are better:
>
> 1. Global param (this RFC patch).
>
> 2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware).
>
> Indeed I like the 2nd method.
>
> 3. Enforce native sched_clock only when TSC is invariant (hyper-v method).
>
> 4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for
> all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may
> want to keep the pv sched_clock.

Normally, it should be up to the hypervisor to tell the guest which
clock to use, i.e. if TSC is reliable or not. Let me put my question
this way: if TSC on the particular host is good for everything, why
does the hypervisor advertises 'kvmclock' to its guests? If for some
'historical reasons' we can't revoke features we can always introduce a
new PV feature bit saying that TSC is preferred.

1) Global param doesn't sound like a good idea to me: chances are that
people will be setting it on their guest images to workaround problems
on one hypervisor (or, rather, on one public cloud which is too lazy to
fix their hypervisor) while simultaneously creating problems on another.

2) KVM specific parameter can work, but as KVM's sched_clock is the same
as kvmclock, I'm not convinced it actually makes sense to separate the
two. Like if sched_clock is known to be bad but TSC is good, why do we
need to use PV clock at all? Having a parameter for debugging purposes
may be OK though...

3) This is Hyper-V specific, you can see that it uses a dedicated PV bit
(HV_ACCESS_TSC_INVARIANT) and not the architectural
CPUID.80000007H:EDX[8]. I'm not sure we can blindly trust the later on
all hypervisors.

4) Personally, I'm not sure that relying on 'TSC is crap' detection is
100% reliable. I can imagine cases when we can't detect that fact that
while synchronized across CPUs and not going backwards, it is, for
example, ticking with an unstable frequency and PV sched clock is
supposed to give the right correction (all of them are rdtsc() based
anyways, aren't they?).

>
> To introduce a param may be easier to backport to old kernel version.
>
> References:
> [1] https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@xxxxxxxxxx/
> [2] https://lore.kernel.org/all/20231018195638.1898375-1-seanjc@xxxxxxxxxx/
> [3] https://lore.kernel.org/all/20231002211651.GA3774@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>
> Thank you very much for the suggestion!
>
> Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx>
> ---
> arch/x86/include/asm/paravirt.h | 2 +-
> arch/x86/kernel/kvmclock.c | 12 +++++++-----
> arch/x86/kernel/paravirt.c | 18 +++++++++++++++++-
> 3 files changed, 25 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index 6c8ff12140ae..f36edf608b6b 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -24,7 +24,7 @@ u64 dummy_sched_clock(void);
> DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock);
> DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock);
>
> -void paravirt_set_sched_clock(u64 (*func)(void));
> +int paravirt_set_sched_clock(u64 (*func)(void));
>
> static __always_inline u64 paravirt_sched_clock(void)
> {
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index fb8f52149be9..0b8bf5677d44 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -93,13 +93,15 @@ static noinstr u64 kvm_sched_clock_read(void)
>
> static inline void kvm_sched_clock_init(bool stable)
> {
> - if (!stable)
> - clear_sched_clock_stable();
> kvm_sched_clock_offset = kvm_clock_read();
> - paravirt_set_sched_clock(kvm_sched_clock_read);
>
> - pr_info("kvm-clock: using sched offset of %llu cycles",
> - kvm_sched_clock_offset);
> + if (!paravirt_set_sched_clock(kvm_sched_clock_read)) {
> + if (!stable)
> + clear_sched_clock_stable();
> +
> + pr_info("kvm-clock: using sched offset of %llu cycles",
> + kvm_sched_clock_offset);
> + }
>
> BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
> sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index 97f1436c1a20..2cfef94317b0 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -118,9 +118,25 @@ static u64 native_steal_clock(int cpu)
> DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
> DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
>
> -void paravirt_set_sched_clock(u64 (*func)(void))
> +static bool no_pv_sched_clock;
> +
> +static int __init parse_no_pv_sched_clock(char *arg)
> +{
> + no_pv_sched_clock = true;
> + return 0;
> +}
> +early_param("no_pv_sched_clock", parse_no_pv_sched_clock);
> +
> +int paravirt_set_sched_clock(u64 (*func)(void))
> {
> + if (no_pv_sched_clock) {
> + pr_info("sched_clock: not configurable\n");
> + return -EPERM;
> + }
> +
> static_call_update(pv_sched_clock, func);
> +
> + return 0;
> }
>
> /* These are in entry.S */

--
Vitaly