Re: [PATCH clocksource] Reject bogus watchdog clocksource measurements

From: Feng Tang
Date: Tue Nov 01 2022 - 01:43:52 EST


On Mon, Oct 31, 2022 at 10:42:12AM -0700, Paul E. McKenney wrote:

[...]
> > > @@ -448,8 +448,26 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > continue;
> > > }
> > > if (wd_nsec > (wdi << 2)) {
> >
> > Just recalled one thing, that it may be better to check 'cs_nsec'
> > instead of 'wd_nsec', as some watchdog may have small wrap-around
> > value. IIRC, HPET's counter is 32 bits long and wraps at about
> > 300 seconds, and PMTIMER's counter is 24 bits which wraps at about
> > 3 ~ 4 seconds. So when a long stall of the watchdog timer happens,
> > the watchdog's value could 'overflow' many times.
> >
> > And usually the 'current' closcksource has longer wrap time than
> > the watchdog.
>
> Why not both?

You mean checking both clocksource and the watchdog? It's fine for
me, though I still trust clocksource more.

I checked some old emails and found some long stall logs for reference.

* one stall of 471 seconds

[ 2410.694068] clocksource: timekeeping watchdog on CPU262: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 2410.706920] clocksource: 'hpet' wd_nsec: 0 wd_now: ffd70be2 wd_last: 40da633b mask: ffffffff
[ 2410.718583] clocksource: 'tsc' cs_nsec: 471766594285 cs_now: 44f62c184e9 cs_last: 394a7a43771 mask: ffffffffffffffff
[ 2410.732568] clocksource: 'tsc' is current clocksource.
[ 2410.740553] tsc: Marking TSC unstable due to clocksource watchdog
[ 2410.747611] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[ 2410.757321] sched_clock: Marking unstable (2398804490960, 11943006672)<-(2419023952548, -8276474713)
[ 2410.767741] clocksource: Checking clocksource tsc synchronization from CPU 233 to CPUs 0,73,93-94,226,454,602,821.
[ 2410.784045] clocksource: Switched to clocksource hpet


* another one of 5 seconds

[ 3302.211708] clocksource: timekeeping watchdog on CPU9: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 3302.211710] clocksource: 'acpi_pm' wd_nsec: 312227950 wd_now: 92367f wd_last: 8128bd mask: ffffff
[ 3302.211712] clocksource: 'tsc' cs_nsec: 4999196389 cs_now: 9e811223a9754 cs_last: 9e80e767df194 mask: ffffffffffffffff
[ 3302.211714] clocksource: 'tsc' is current clocksource.
[ 3302.211716] tsc: Marking TSC unstable due to clocksource watchdog


>
> if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {
>
> > > - /* This can happen on busy systems, which can delay the watchdog. */
> > > - pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > + bool needwarn = false;
> > > + u64 wd_lb;
> > > +
> > > + cs->wd_bogus_count++;
> > > + if (!cs->wd_bogus_shift) {
> > > + needwarn = true;
> > > + } else {
> > > + delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
> > > + wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
> > > + if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
> > > + needwarn = true;
> >
> > I'm not sure if we need to check the last_bogus counter, or just
> > the current interval 'cs_nsec' is what we care, and some code
> > like this ?
>
> I thought we wanted exponential backoff? Do you really get that from
> the changes below?

Aha, I misunderstood your words. I thought to only report one time for
each 2, 4, 8, ... 256 seconds stall, and after that only report stall
of 512+ seconds. So your approach looks good to me, as our intention is
to avoid the flood of warning message.

Thanks,
Feng

> And should we be using something like the jiffies counter to measure the
> exponential backoff?
>
> Thanx, Paul
>
> > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> > index daac05aedf56..3910dbb9b960 100644
> > --- a/include/linux/clocksource.h
> > +++ b/include/linux/clocksource.h
> > @@ -125,7 +125,6 @@ struct clocksource {
> > struct list_head wd_list;
> > u64 cs_last;
> > u64 wd_last;
> > - u64 wd_last_bogus;
> > int wd_bogus_shift;
> > unsigned long wd_bogus_count;
> > unsigned long wd_bogus_count_last;
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index 6537ffa02e44..8e6d498b1492 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -442,28 +442,18 @@ static void clocksource_watchdog(struct timer_list *unused)
> >
> > /* Check for bogus measurements. */
> > wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > - if (wd_nsec < (wdi >> 2)) {
> > + if (cs_nsec < (wdi >> 2)) {
> > /* This usually indicates broken timer code or hardware. */
> > - pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > + pr_warn("timekeeping watchdog on CPU%d: clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), cs->name, wd_nsec, WATCHDOG_INTERVAL);
> > continue;
> > }
> > - if (wd_nsec > (wdi << 2)) {
> > - bool needwarn = false;
> > - u64 wd_lb;
> > -
> > + if (cs_nsec > (wdi << 2)) {
> > cs->wd_bogus_count++;
> > - if (!cs->wd_bogus_shift) {
> > - needwarn = true;
> > - } else {
> > - delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
> > - wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
> > - if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
> > - needwarn = true;
> > - }
> > - if (needwarn) {
> > + if (!cs->wd_bogus_shift ||
> > + (1 << cs->wd_bogus_shift) * wdi <= cs_nsec) {
> > /* This can happen on busy systems, which can delay the watchdog. */
> > - pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
> > - cs->wd_last_bogus = wdnow;
> > + pr_warn("timekeeping watchdog on CPU%d: clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), cs->name, cs_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
> > +
> > if (cs->wd_bogus_shift < 10)
> > cs->wd_bogus_shift++;
> > cs->wd_bogus_count_last = cs->wd_bogus_count;
> >
> > Thanks,
> > Feng
> >
> >
> > > + }
> > > + if (needwarn) {
> > > + /* This can happen on busy systems, which can delay the watchdog. */
> > > + pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
> > > + cs->wd_last_bogus = wdnow;
> > > + if (cs->wd_bogus_shift < 10)
> > > + cs->wd_bogus_shift++;
> > > + cs->wd_bogus_count_last = cs->wd_bogus_count;
> > > + }
> > > continue;
> > > }
> > >