Re: [RFC PATCH] clocksource: Suspend the watchdog temporarily when high read lantency detected

From: Feng Tang
Date: Tue Dec 20 2022 - 20:05:27 EST


Using correct email address of John Stultz.

On Tue, Dec 20, 2022 at 10:34:00AM -0800, Paul E. McKenney wrote:
> On Tue, Dec 20, 2022 at 11:11:08AM -0500, Waiman Long wrote:
> > On 12/20/22 03:25, Feng Tang wrote:
> > > There were bug reported on 8 sockets x86 machines that TSC was wrongly
> > > disabled when system is under heavy workload.
> > >
> > > [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
> > > [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
> > > [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
> > > [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
> > > [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
> > > [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
> > > [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> > > [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
> > > [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
> > > [ 821.067990] clocksource: Switched to clocksource hpet
> > >
> > > This can be reproduced when system is running memory intensive 'stream'
> > > test, or some stress-ng subcases like 'ioport'.
> > >
> > > The reason is when system is under heavy load, the read latency of
> > > clocksource can be very high, it can be seen even with lightweight
> > > TSC read, and is much worse on MMIO or IO port read based external
> > > clocksource. Causing the watchdog check to be inaccurate.
> > >
> > > As the clocksource watchdog is a lifetime check with frequency of
> > > twice a second, there is no need to rush doing it when the system
> > > is under heavy load and the clocksource read latency is very high,
> > > suspend the watchdog timer for 5 minutes.
> > >
> > > Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx>
> > > ---
> > > kernel/time/clocksource.c | 45 ++++++++++++++++++++++++++++-----------
> > > 1 file changed, 32 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > index 9cf32ccda715..8cd74b89d577 100644
> > > --- a/kernel/time/clocksource.c
> > > +++ b/kernel/time/clocksource.c
> > > @@ -384,6 +384,15 @@ void clocksource_verify_percpu(struct clocksource *cs)
> > > }
> > > EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
> > > +static inline void clocksource_reset_watchdog(void)
> > > +{
> > > + struct clocksource *cs;
> > > +
> > > + list_for_each_entry(cs, &watchdog_list, wd_list)
> > > + cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> > > +}
> > > +
> > > +
> > > static void clocksource_watchdog(struct timer_list *unused)
> > > {
> > > u64 csnow, wdnow, cslast, wdlast, delta;
> > > @@ -391,6 +400,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > int64_t wd_nsec, cs_nsec;
> > > struct clocksource *cs;
> > > enum wd_read_status read_ret;
> > > + unsigned long extra_wait = 0;
> > > u32 md;
> > > spin_lock(&watchdog_lock);
> > > @@ -410,13 +420,30 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
> > > - if (read_ret != WD_READ_SUCCESS) {
> > > - if (read_ret == WD_READ_UNSTABLE)
> > > - /* Clock readout unreliable, so give it up. */
> > > - __clocksource_unstable(cs);
> > > + if (read_ret == WD_READ_UNSTABLE) {
> > > + /* Clock readout unreliable, so give it up. */
> > > + __clocksource_unstable(cs);
> > > continue;
> > > }
> > > + /*
> > > + * When WD_READ_SKIP is returned, it means the system is likely
> > > + * under very heavy load, where the latency of reading
> > > + * watchdog/clocksource is very big, and affect the accuracy of
> > > + * watchdog check. So give system some space and suspend the
> > > + * watchdog check for 5 minutes.
> > > + */
> > > + if (read_ret == WD_READ_SKIP) {
> > > + /*
> > > + * As the watchdog timer will be suspended, and
> > > + * cs->last could keep unchanged for 5 minutes, reset
> > > + * the counters.
> > > + */
> > > + clocksource_reset_watchdog();
> > > + extra_wait = HZ * 300;
> > > + break;
> > > + }
> > > +
> > > /* Clocksource initialized ? */
> > > if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
> > > atomic_read(&watchdog_reset_pending)) {
> > > @@ -512,7 +539,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > * pair clocksource_stop_watchdog() clocksource_start_watchdog().
> > > */
> > > if (!timer_pending(&watchdog_timer)) {
> > > - watchdog_timer.expires += WATCHDOG_INTERVAL;
> > > + watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
> > > add_timer_on(&watchdog_timer, next_cpu);
> > > }
> > > out:
> > > @@ -537,14 +564,6 @@ static inline void clocksource_stop_watchdog(void)
> > > watchdog_running = 0;
> > > }
> > > -static inline void clocksource_reset_watchdog(void)
> > > -{
> > > - struct clocksource *cs;
> > > -
> > > - list_for_each_entry(cs, &watchdog_list, wd_list)
> > > - cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> > > -}
> > > -
> > > static void clocksource_resume_watchdog(void)
> > > {
> > > atomic_inc(&watchdog_reset_pending);
> >
> > It looks reasonable to me. Thanks for the patch.
> >
> > Acked-by: Waiman Long <longman@xxxxxxxxxx>
>
> Queued, thank you both!

Thanks for reviewing and queueing!

> If you would like this to go in some other way:
>
> Acked-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
>
> And while I am remembering it... Any objections to reversing the role of
> TSC and the other timers on systems where TSC is believed to be accurate?
> So that if there is clocksource skew, HPET is marked unstable rather than
> TSC?

For the bug in commit log, I think it's the 8 sockets system with
hundreds of CPUs causing the big latency, while the HPET itself may
not be broken, and if we switched to ACPI PM_TIMER as watchdog, we
could see similar big latency.

I used to only see this issue with stress tool like stress-ng, but
seems with larger and larger system, even the momory intensive load
can easily trigger this.

> This would preserve the diagnostics without hammering performance
> when skew is detected. (Switching from TSC to HPET hammers performance
> enough that our automation usually notices and reboots the system.)

Yes, switching to HPET is a disaster for performance, we've seen
from 30% to 90% drop in different benchmarks.

Thanks,
Feng

> Thanx, Paul