Re: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

From: Sean Christopherson
Date: Thu Oct 19 2023 - 11:41:10 EST


On Thu, Oct 19, 2023, Vitaly Kuznetsov wrote:
> Dongli Zhang <dongli.zhang@xxxxxxxxxx> writes:
>
> > As mentioned in the linux kernel development document, "sched_clock() is
> > used for scheduling and timestamping". While there is a default native
> > implementation, many paravirtualizations have their own implementations.
> >
> > About KVM, it uses kvm_sched_clock_read() and there is no way to only
> > disable KVM's sched_clock. The "no-kvmclock" may disable all
> > paravirtualized kvmclock features.

...

> > Please suggest and comment if other options are better:
> >
> > 1. Global param (this RFC patch).
> >
> > 2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware).
> >
> > Indeed I like the 2nd method.
> >
> > 3. Enforce native sched_clock only when TSC is invariant (hyper-v method).
> >
> > 4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for
> > all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may
> > want to keep the pv sched_clock.
>
> Normally, it should be up to the hypervisor to tell the guest which
> clock to use, i.e. if TSC is reliable or not. Let me put my question
> this way: if TSC on the particular host is good for everything, why
> does the hypervisor advertises 'kvmclock' to its guests?

I suspect there are two reasons.

1. As is likely the case in our fleet, no one revisited the set of advertised
PV features when defining the VM shapes for a new generation of hardware, or
whoever did the reviews wasn't aware that advertising kvmclock is actually
suboptimal. All the PV clock stuff in KVM is quite labyrinthian, so it's
not hard to imagine it getting overlooked.

2. Legacy VMs. If VMs have been running with a PV clock for years, forcing
them to switch to a new clocksource is high-risk, low-reward.

> If for some 'historical reasons' we can't revoke features we can always
> introduce a new PV feature bit saying that TSC is preferred.
>
> 1) Global param doesn't sound like a good idea to me: chances are that
> people will be setting it on their guest images to workaround problems
> on one hypervisor (or, rather, on one public cloud which is too lazy to
> fix their hypervisor) while simultaneously creating problems on another.
>
> 2) KVM specific parameter can work, but as KVM's sched_clock is the same
> as kvmclock, I'm not convinced it actually makes sense to separate the
> two. Like if sched_clock is known to be bad but TSC is good, why do we
> need to use PV clock at all? Having a parameter for debugging purposes
> may be OK though...
>
> 3) This is Hyper-V specific, you can see that it uses a dedicated PV bit
> (HV_ACCESS_TSC_INVARIANT) and not the architectural
> CPUID.80000007H:EDX[8]. I'm not sure we can blindly trust the later on
> all hypervisors.
>
> 4) Personally, I'm not sure that relying on 'TSC is crap' detection is
> 100% reliable. I can imagine cases when we can't detect that fact that
> while synchronized across CPUs and not going backwards, it is, for
> example, ticking with an unstable frequency and PV sched clock is
> supposed to give the right correction (all of them are rdtsc() based
> anyways, aren't they?).

Yeah, practically speaking, the only thing adding a knob to turn off using PV
clocks for sched_clock will accomplish is creating an even bigger matrix of
combinations that can cause problems, e.g. where guests end up using kvmclock
timekeeping but not scheduling.

The explanation above and the links below fail to capture _the_ key point:
Linux-as-a-guest already prioritizes the TSC over paravirt clocks as the clocksource
when the TSC is constant and nonstop (first spliced blob below).

What I suggested is that if the TSC is chosen over a PV clock as the clocksource,
then we have the kernel also override the sched_clock selection (second spliced
blob below).

That doesn't require the guest admin to opt-in, and doesn't create even more
combinations to support. It also provides for a smoother transition for when
customers inevitably end up creating VMs on hosts that don't advertise kvmclock
(or any PV clock).

> > To introduce a param may be easier to backport to old kernel version.
> >
> > References:
> > [1] https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@xxxxxxxxxx/
> > [2] https://lore.kernel.org/all/20231018195638.1898375-1-seanjc@xxxxxxxxxx/
> > [3] https://lore.kernel.org/all/20231002211651.GA3774@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

On Mon, Oct 2, 2023 at 11:18 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > Do we need to update the documentation to always suggest TSC when it is
> > constant, as I believe many users still prefer pv clock than tsc?
> >
> > Thanks to tsc ratio scaling, the live migration will not impact tsc.
> >
> > >From the source code, the rating of kvm-clock is still higher than tsc.
> >
> > BTW., how about to decrease the rating if guest detects constant tsc?
> >
> > 166 struct clocksource kvm_clock = {
> > 167 .name = "kvm-clock",
> > 168 .read = kvm_clock_get_cycles,
> > 169 .rating = 400,
> > 170 .mask = CLOCKSOURCE_MASK(64),
> > 171 .flags = CLOCK_SOURCE_IS_CONTINUOUS,
> > 172 .enable = kvm_cs_enable,
> > 173 };
> >
> > 1196 static struct clocksource clocksource_tsc = {
> > 1197 .name = "tsc",
> > 1198 .rating = 300,
> > 1199 .read = read_tsc,
>
> That's already done in kvmclock_init().
>
> if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> !check_tsc_unstable())
> kvm_clock.rating = 299;
>
> See also: https://lore.kernel.org/all/ZOjF2DMBgW%2FzVvL3@xxxxxxxxxx
>
> > 2. The sched_clock.
> >
> > The scheduling is impacted if there is big drift.
>
> ...
>
> > Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock
> > operations (not only sched_clock), e.g., after line 300.
>
> ...
>
> > Should I introduce a new param to disable no-kvm-sched-clock only, or to
> > introduce a new param to allow the selection of sched_clock?
>
> I don't think we want a KVM-specific knob, because every flavor of paravirt guest
> would need to do the same thing. And unless there's a good reason to use a
> paravirt clock, this really shouldn't be something the guest admin needs to opt
> into using.


On Mon, Oct 2, 2023 at 2:06 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Mon, Oct 02, 2023 at 11:18:50AM -0700, Sean Christopherson wrote:
> > Assuming the desirable thing to do is to use native_sched_clock() in this
> > scenario, do we need a separate rating system, or can we simply tie the
> > sched clock selection to the clocksource selection, e.g. override the
> > paravirt stuff if the TSC clock has higher priority and is chosen?
>
> Yeah, I see no point of another rating system. Just force the thing back
> to native (or don't set it to that other thing).