Re: [RFC PATCH] clocksource: Suspend the watchdog temporarily when high read lantency detected

From: Paul E. McKenney
Date: Thu Dec 22 2022 - 18:28:53 EST


On Thu, Dec 22, 2022 at 01:42:19PM -0800, Paul E. McKenney wrote:
> On Thu, Dec 22, 2022 at 10:24:46AM -0800, Paul E. McKenney wrote:
> > On Thu, Dec 22, 2022 at 02:37:24PM +0800, Feng Tang wrote:
> > > On Wed, Dec 21, 2022 at 10:14:29PM -0800, Paul E. McKenney wrote:
> > > > On Thu, Dec 22, 2022 at 02:00:42PM +0800, Feng Tang wrote:
> > > > > On Wed, Dec 21, 2022 at 09:55:15PM -0800, Paul E. McKenney wrote:
> > > > > > On Wed, Dec 21, 2022 at 10:39:53PM -0500, Waiman Long wrote:
>
> [ . . . ]
>
> > > > > > > As I currently understand, you are trying to use TSC as a watchdog to check
> > > > > > > against HPET and PMTMR. I do have 2 questions about this patch.
> > > > > > >
> > > > > > > First of all, why you need to use both HPET and PMTMR? Can you just use one
> > > > > > > of those that are available. Secondly, is it possible to enable this
> > > > > > > time-skew diagnostic for a limit amount of time instead running
> > > > > > > indefinitely? The running of the clocksource watchdog itself will still
> > > > > > > consume a tiny amount of CPU cycles.
> > > > > >
> > > > > > I could certainly do something so that only the first of HPET and PMTMR
> > > > > > is checked. Could you give me a quick run-through of the advantages of
> > > > > > using only one? I would need to explain that in the commit log.
> > > > > >
> > > > > > Would it make sense to have a kernel boot variable giving the number of
> > > > > > minutes for which the watchdog was to run, with a default of zero
> > > > > > meaning "indefinitely"?
> > > > >
> > > > > We've discussed about the "os noise", which customer may really care.
> > > > > IIUC, this patch intends to test if HPET/PMTIMER HW is broken, so how
> > > > > about making it run for a number of minutes the default behavior.
> > > >
> > > > It is also intended to determine if TSC is broken, with NTP drift rates
> > > > used to determine which timer is at fault.
> > > >
> > > > OK, how about a Kconfig option for the number of minutes, set to whatever
> > > > you guys tell me? (Three minutes? Five minutes? Something else?)
> > > > People wanting to run it continuously could then build their kernels
> > > > with that Kconfig option set to zero.
> > >
> > > I don't have specific preference for 5 or 10 minutes, as long as it
> > > is a one time deal :)
> > >
> > > > > Also I've run the patch on a Alderlake system, with a fine acpi pm_timer
> > > > > and a fake broken pm_timer, and they both works without errors.
> > > >
> > > > Thank you! Did it correctly identify the fake broken pm_timer as being
> > > > broken? If so, may I have your Tested-by?
> > >
> > > On that Alderlake system, HPET will be disabled by kernel, and I
> > > manually increased the tsc frequency about 1/256 to make pm_timer
> > > look to have 1/256 deviation. And got dmesg like:
> > >
> > > [ 2.738554] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'acpi_pm' as unstable because the skew is too large:
> > > [ 2.738558] clocksource: 'tsc' wd_nsec: 275956624 wd_now: 13aab38d0d wd_last: 1382c1144d mask: ffffffffffffffff
> > > [ 2.738564] clocksource: 'acpi_pm' cs_nsec: 277034651 cs_now: 731575 cs_last: 63f3cb mask: ffffff
> > > [ 2.738568] clocksource: 'tsc' (not 'acpi_pm') is current clocksource.
> > >
> > > The deviation is indeed about 1/256. And pm_timer won't be shown in /sys/:
> > >
> > > /sys/devices/system/clocksource/clocksource0/available_clocksource:tsc
> > > /sys/devices/system/clocksource/clocksource0/current_clocksource:tsc
> > >
> > > So feel free to add:
> > >
> > > Tested-by: Feng Tang <feng.tang@xxxxxxxxx>
> >
> > Thank you very much! I will apply this on my next rebase.
>
> But first, here is a prototype of the limited-time clocksource watchdog.

And here is the limited-watchdogging patch.

Thoughts?

Thanx, Paul

------------------------------------------------------------------------

commit 375e65d3055f9b379e5fbd449e69752cb69b4e19
Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
Date: Thu Dec 22 15:20:37 2022 -0800

clocksource: Limit the number of watchdogged clocksources

When TSC is deemed trustworthy, the clocksource watchdog will verify
other clocksources against it. @@@ Why limit it? Needed from Waiman.
Maybe overhead and disturbance of additional checks? @@@

Therefore, supply a new tsc_watchdogged kernel boot parameter that
limits the number of clocksources that will be verified against TSC.
This parameter defaults to INT_MAX. A value of zero prevents any
verification.

Link: https://lore.kernel.org/all/a82092f5-abc8-584f-b2ba-f06c82ffbe7d@xxxxxxxxxx/
Reported-by: Waiman Long <longman@xxxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxxxx>
Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
Cc: Daniel Lezcano <daniel.lezcano@xxxxxxxxxx>
Cc: Feng Tang <feng.tang@xxxxxxxxx>
Cc: Waiman Long <longman@xxxxxxxxxx
Cc: <x86@xxxxxxxxxx>

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 68c597e5909a4..0e304e40c21fa 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6345,6 +6345,12 @@
with CPUID.16h support and partial CPUID.15h support.
Format: <unsigned int>

+ tsc_watchdogged= [X86]
+ Specify the limit on the number of clocksources
+ that will be verified against TSC in cases where
+ the TSC is deemed trustworthy. Defaults to
+ INT_MAX. Specify zero to avoid verification.
+
tsx= [X86] Control Transactional Synchronization
Extensions (TSX) feature in Intel processors that
support TSX control.
diff --git a/arch/x86/include/asm/time.h b/arch/x86/include/asm/time.h
index a53961c64a567..0fa15c4819082 100644
--- a/arch/x86/include/asm/time.h
+++ b/arch/x86/include/asm/time.h
@@ -8,7 +8,7 @@
extern void hpet_time_init(void);
extern void time_init(void);
extern bool pit_timer_init(void);
-extern bool tsc_clocksource_watchdog_disabled(void);
+extern void tsc_clocksource_watchdog_disabled(struct clocksource *csp);

extern struct clock_event_device *global_clock_event;

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index c8eb1ac5125ab..cf28b0abc06bd 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -1091,8 +1091,7 @@ int __init hpet_enable(void)
if (!hpet_counting())
goto out_nohpet;

- if (tsc_clocksource_watchdog_disabled())
- clocksource_hpet.flags |= CLOCK_SOURCE_MUST_VERIFY;
+ tsc_clocksource_watchdog_disabled(&clocksource_hpet);
clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq);

if (id & HPET_ID_LEGSUP) {
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 924e877b95f31..6a1def7c02a6e 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -53,6 +53,9 @@ static u32 art_to_tsc_denominator;
static u64 art_to_tsc_offset;
struct clocksource *art_related_clocksource;

+static int max_tsc_watchdogged = INT_MAX;
+static atomic_t cur_tsc_watchdogged;
+
struct cyc2ns {
struct cyc2ns_data data[2]; /* 0 + 2*16 = 32 */
seqcount_latch_t seq; /* 32 + 4 = 36 */
@@ -308,6 +311,14 @@ static int __init tsc_setup(char *str)

__setup("tsc=", tsc_setup);

+static int __init tsc_watchdogged_setup(char *str)
+{
+ max_tsc_watchdogged = simple_strtol(str, NULL, 0);
+ return 1;
+}
+
+__setup("tsc_watchdogged=", tsc_watchdogged_setup);
+
#define MAX_RETRIES 5
#define TSC_DEFAULT_THRESHOLD 0x20000

@@ -1186,9 +1197,18 @@ static void __init tsc_disable_clocksource_watchdog(void)
clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
}

-bool tsc_clocksource_watchdog_disabled(void)
+/*
+ * If the TSC is judged trustworthy and the limit on the number of
+ * to-be-watchdogged clocksources has not been exceeded, place the specified
+ * clocksource into must-verify state.
+ */
+void tsc_clocksource_watchdog_disabled(struct clocksource *csp)
{
- return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY);
+ if (clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY ||
+ atomic_inc_return(&cur_tsc_watchdogged) > max_tsc_watchdogged)
+ return;
+ pr_info("clocksource: '%s' will be checked by clocksource watchdog.\n", csp->name);
+ csp->flags |= CLOCK_SOURCE_MUST_VERIFY;
}

static void __init check_system_tsc_reliable(void)
diff --git a/drivers/clocksource/acpi_pm.c b/drivers/clocksource/acpi_pm.c
index 82338773602ca..8562f59ac27e9 100644
--- a/drivers/clocksource/acpi_pm.c
+++ b/drivers/clocksource/acpi_pm.c
@@ -211,8 +211,7 @@ static int __init init_acpi_pm_clocksource(void)
return -ENODEV;
}

- if (tsc_clocksource_watchdog_disabled())
- clocksource_acpi_pm.flags |= CLOCK_SOURCE_MUST_VERIFY;
+ tsc_clocksource_watchdog_disabled(&clocksource_acpi_pm);
return clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
}