Re: [PATCH v2] clocksource: Skip watchdog check for large watchdog intervals

From: Jiri Wiesner
Date: Sat Jan 13 2024 - 06:44:19 EST


On Fri, Jan 12, 2024 at 05:48:22PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 10 2024 at 20:26, Jiri Wiesner wrote:
> > The measured clocksource skew - the absolute difference between cs_nsec
> > and wd_nsec - was 668 microseconds:
> >> cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
> >
> > The kernel (based on 5.14.21) used 200 microseconds for the
> > uncertainty_margin of both the clocksource and watchdog, resulting in a
> > threshold of 400 microseconds. The discrepancy is that the measured
> > clocksource skew was evaluated against a threshold suited for watchdog
> > intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5
> > second.
>
> This really took some time to decode. What you are trying to explain is:
>
> The comparison between the clocksource and the watchdog is not
> working for large readout intervals because the conversion to
> nanoseconds is imprecise. The reason is that the initialization
> values of the shift/mult pairs which are used for conversion are not
> sufficiently accurate and the accumulated inaccuracy causes the
> comparison to exceed the threshold.

The root cause of the bug does not concern the precision of the conversion
to nanoseconds. The shift/mult pair of the TSC can convert diffs as large
as 600 seconds. The HPET is limited to 179.0 seconds on account of being a
32-bit counter. The acpi_pm can convert only 4.7 seconds. With the
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE option enabled, the ranges are
reduced to a half. The example above showed the TSC as the clocksource and
the HPET as a watchdog both of which should be able to convert a diff of
14.5 seconds to nanoseconds with sufficient precision.

I could change the description to:

The kernel (based on 5.14.21) used 200 microseconds for the
uncertainty_margin of both the clocksource and watchdog, resulting in a
threshold of 400 microseconds (the md variable). The root cause of the
issue is that the measured clocksource skew, 668 microseconds, was
evaluated against a threshold (the md variable) which is suited for
watchdog intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is
0.5 second. Both the cs_nsec and the wd_nsec value indicate that the
watchdog interval was circa 14.5 seconds.

The intention of 2e27e793e280 ("clocksource: Reduce clocksource-skew
threshold") was to tighten the threshold for evaluating skew and set the
lower bound for the uncertainty_margin of clocksources to twice
WATCHDOG_MAX_SKEW. Later in c37e85c135ce ("clocksource: Loosen clocksource
watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to
125 microseconds to fit the limit of NTP, which is able to use
a clocksource that suffers from up to 500 microseconds of skew per second.
Both the TSC and the HPET use default uncertainty_margin. When the
watchdog interval gets stretched the default uncertainty_margin is no
longer a suitable lower bound for evaluating skew - it imposes a limit
that is stricter than the skew with which NTP can deal. The longer the
watchdog interval is the larger the threshold should be. For evaluating
skew in a watchdog interval of 14.5 seconds, a proportional threshold
should be used, which should be 14500 microseconds (7250 coming from the
TSC, 7250 coming from the HPET).

> So yes, limiting the maximum readout interval and skipping the check is
> sensible.

It is a bug to mark a clocksource unstable if the skew is 668 microseconds
in 14.5 seconds. One possible solution is to skip the check. I originally
posted a patch scaling the uncertainty_margin of clocksources but it got
no support and the feedback I got was to avoid the calculation and skip
the current check in order to keep the code simple:
https://lore.kernel.org/lkml/20231221160517.GA22919@incl/#t
Since skipping the check solves issue as well I sent a patch.

> > /*
> > * Interval: 0.5sec.
> > */
> > -#define WATCHDOG_INTERVAL (HZ >> 1)
> > +#define WATCHDOG_INTERVAL (HZ >> 1)
> > +#define WATCHDOG_INTR_MAX_NS ((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\
> > + * (NSEC_PER_SEC / HZ))
>
> That 1.5 * WATCHDOG_INTERVAL seems to be rather arbitrary. One second
> should be safe enough, no?

Yes, it is arbitrary. The concern is how strict can we allow the skew
check to get. 2 * WATCHDOG_INTERVAL would mean imposing a skew threshold
of 250 microseconds per second for intervals that are close in their value
to 2 * WATCHDOG_INTERVAL. Even using 2 * WATCHDOG_INTERVAL would still be
many times better than using 500 microseconds to check skew in a 14.5-long
watchdog interval.

> > + /*
> > + * The processing of timer softirqs can get delayed (usually
> > + * on account of ksoftirqd not getting to run in a timely
> > + * manner), which causes the watchdog interval to stretch.
> > + * Some clocksources, e.g. acpi_pm, cannot tolerate
> > + * watchdog intervals longer than a few seconds.
>
> What ensures that the watchdog did not wrap around then?

Nothing. It has always been this way. The check usually fails when the
watchdog wraps around, in which case the clocksource is marked unstable
for no fault of its own.

> > + watchdog_max_intr = interval;
> > + pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
> > + cs_nsec, wd_nsec);
>
> This really wants to have a proper indication why the check was skipped,
> i,e. due to a long readout interval, no?

It could be changed to:
pr_warn("Large watchdog interval, skipping check: cs_nsec: %lld wd_nsec: %lld\n",

I will send a v3 incorporation all the suggestions after we have made the
description intelligible. Thank you for the feedback.
--
Jiri Wiesner
SUSE Labs