Re: [RFC PATCH] clocksource: Suspend the watchdog temporarily when high read lantency detected

From: Paul E. McKenney
Date: Thu Dec 22 2022 - 22:44:23 EST


On Fri, Dec 23, 2022 at 10:09:04AM +0800, Feng Tang wrote:
> On Thu, Dec 22, 2022 at 01:42:19PM -0800, Paul E. McKenney wrote:
> [...]
> > > >
> > > > > > Also I've run the patch on a Alderlake system, with a fine acpi pm_timer
> > > > > > and a fake broken pm_timer, and they both works without errors.
> > > > >
> > > > > Thank you! Did it correctly identify the fake broken pm_timer as being
> > > > > broken? If so, may I have your Tested-by?
> > > >
> > > > On that Alderlake system, HPET will be disabled by kernel, and I
> > > > manually increased the tsc frequency about 1/256 to make pm_timer
> > > > look to have 1/256 deviation. And got dmesg like:
> > > >
> > > > [ 2.738554] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'acpi_pm' as unstable because the skew is too large:
> > > > [ 2.738558] clocksource: 'tsc' wd_nsec: 275956624 wd_now: 13aab38d0d wd_last: 1382c1144d mask: ffffffffffffffff
> > > > [ 2.738564] clocksource: 'acpi_pm' cs_nsec: 277034651 cs_now: 731575 cs_last: 63f3cb mask: ffffff
> > > > [ 2.738568] clocksource: 'tsc' (not 'acpi_pm') is current clocksource.
> > > >
> > > > The deviation is indeed about 1/256. And pm_timer won't be shown in /sys/:
> > > >
> > > > /sys/devices/system/clocksource/clocksource0/available_clocksource:tsc
> > > > /sys/devices/system/clocksource/clocksource0/current_clocksource:tsc
> > > >
> > > > So feel free to add:
> > > >
> > > > Tested-by: Feng Tang <feng.tang@xxxxxxxxx>
> > >
> > > Thank you very much! I will apply this on my next rebase.
> >
> > But first, here is a prototype of the limited-time clocksource watchdog.
> > Thoughts?
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > commit dfbf67806c4c7f2bdd79cdefe86a2bea6e7afcab
> > Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > Date: Thu Dec 22 13:21:47 2022 -0800
> >
> > clocksource: Permit limited-duration clocksource watchdogging
> >
> > Some systems want regular clocksource checking, but their workloads
> > cannot tolerate the overhead of full-time clocksource checking.
> > Therefore, add a CLOCKSOURCE_WATCHDOG_DURATION Kconfig option and a
> > clocksource.watchdog_duration kernel-boot parameter that limits the
> > clocksource watchdog to the specified number of minutes past boot.
> > Values of zero disable checking, and a value of -1 restores the
> > traditional behavior of always checking.
> >
> > This does change behavior compared to older kernels, but recent kernels
> > disable the clocksource watchdog completely in the common case where the
> > TSC is judged to be trustworthy. This change in behavior is therefore
> > not a real issue.
>
> Yes, this changes the general semantics. Last year, I've posted a
> patch to limit the watchdog to run for 10 minutes, and at that time
> Thomas mentioned one of his machine may show tsc issue after running
> for one day depending on work load [1].
>
> As the intention is to validate HPET/PMTIMER, which are not as
> delicate as TSC, maybe we can add a per-clocksource verify-period
> field, and only set it for HPET/PMTIMER?
>
> [1]. https://lore.kernel.org/lkml/875z286xtk.fsf@xxxxxxxxxxxxxxxxxxxxxxx/

Got it.

The workloads I am closest to are OK with the clocksource watchdog
running indefinitely, but thus far the skew is visible very early.
But broken hardware can do whatever it wants whenever it wants. I could
meet Thomas's request by making the default be indefinite, and allowing
whoever cares to make it finite. Or maybe the fact that the TSC is not
marked unstable makes a difference.

Thoughts?

Thanx, Paul

> Thanks,
> Feng
>
> > Link: https://lore.kernel.org/all/a82092f5-abc8-584f-b2ba-f06c82ffbe7d@xxxxxxxxxx/
> > Reported-by: Waiman Long <longman@xxxxxxxxxx>
> > Reported-by: Feng Tang <feng.tang@xxxxxxxxx>
> > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> >
> > diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> > index bae8f11070bef..2676e011673d5 100644
> > --- a/kernel/time/Kconfig
> > +++ b/kernel/time/Kconfig
> > @@ -209,5 +209,17 @@ config CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
> > per million. If the clocksource is good enough for NTP,
> > it is good enough for the clocksource watchdog!
> >
> > +config CLOCKSOURCE_WATCHDOG_DURATION
> > + int "Default time to run clocksource watchdog (in minutes)"
> > + depends on CLOCKSOURCE_WATCHDOG
> > + range -1 1440
> > + default 10
> > + help
> > + Specify the default number of minutes that the clocksource
> > + watchdog should run after being started. Specify -1 to run
> > + indefinitely or zero to not run at all. This value may be
> > + overridden using the clocksource.watchdog_duration kernel
> > + boot parameter.
> > +
> > endmenu
> > endif
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index b153fce8c0e4b..c920c6c22e0fb 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -213,6 +213,9 @@ module_param(max_cswd_read_retries, ulong, 0644);
> > EXPORT_SYMBOL_GPL(max_cswd_read_retries);
> > static int verify_n_cpus = 8;
> > module_param(verify_n_cpus, int, 0644);
> > +static int watchdog_duration = CONFIG_CLOCKSOURCE_WATCHDOG_DURATION;
> > +module_param(watchdog_duration, int, 0444);
> > +static unsigned long watchdog_end_jiffies;
> >
> > enum wd_read_status {
> > WD_READ_SUCCESS,
> > @@ -549,7 +552,9 @@ static void clocksource_watchdog(struct timer_list *unused)
> > * Arm timer if not already pending: could race with concurrent
> > * pair clocksource_stop_watchdog() clocksource_start_watchdog().
> > */
> > - if (!timer_pending(&watchdog_timer)) {
> > + if (!timer_pending(&watchdog_timer) &&
> > + (watchdog_duration < 0 ||
> > + (watchdog_duration >= 0 && time_before(jiffies, watchdog_end_jiffies)))) {
> > watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
> > add_timer_on(&watchdog_timer, next_cpu);
> > }
> > @@ -559,8 +564,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> >
> > static inline void clocksource_start_watchdog(void)
> > {
> > - if (watchdog_running || !watchdog || list_empty(&watchdog_list))
> > + if (watchdog_running || !watchdog || list_empty(&watchdog_list) || !watchdog_duration)
> > return;
> > + if (watchdog_duration > 0)
> > + watchdog_end_jiffies = jiffies + watchdog_duration * 60 * HZ;
> > timer_setup(&watchdog_timer, clocksource_watchdog, 0);
> > watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
> > add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask));