Re: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

From: Paul E. McKenney
Date: Sat Jan 06 2024 - 07:04:25 EST


On Sat, Jan 06, 2024 at 10:55:09AM +0800, Feng Tang wrote:
> On Thu, Jan 04, 2024 at 11:19:56AM -0800, Paul E. McKenney wrote:
> > On Thu, Jan 04, 2024 at 05:30:50PM +0100, Jiri Wiesner wrote:
> > > On Wed, Jan 03, 2024 at 02:08:08PM -0800, Paul E. McKenney wrote:
> > > > I believe that there were concerns about a similar approach in the case
> > > > where the jiffies counter is the clocksource
> > >
> > > I ran a few simple tests on a 2 NUMA node Intel machine and found nothing
> > > so far. I tried booting with clocksource=jiffies and I changed the
> > > "nr_online_nodes <= 4" check in tsc_clocksource_as_watchdog() to enable
> > > the watchdog on my machine. I have a debugging module that monitors
> > > clocksource and watchdog reads in clocksource_watchdog() with kprobes. I
> > > see the cs/wd reads executed roughly every 0.5 second, as expected. When
> > > the machine is idle the average watchdog interval is 501.61 milliseconds
> > > (+-15.57 ms, with a minimum of 477.07 ms and a maximum of 517.93 ms). The
> > > result is similar when the CPUs of the machine are fully saturated with
> > > netperf processes. I also tried booting with clocksource=jiffies and
> > > tsc=watchdog. The watchdog interval was similar to the previous test.
> > >
> > > AFAIK, the jiffies clocksource does get checked by the watchdog itself.
> > > And with that, I have run out of ideas.
> >
> > If I recall correctly (ha!), the concern was that with the jiffies as
> > clocksource, we would be using jiffies (via timers) to check jiffies
> > (the clocksource), and that this could cause issues if the jiffies got
> > behind, then suddenly updated while the clocksource watchdog was running.
>
> Yes, we also met problem when 'jiffies' was used as clocksource/watchdog,
> but don't know if it's the same problem you mentioned. Our problem
> ('jiffies' as watchdog marks clocksource TSC as unstable) only happens
> in early boot phase with serial earlyprintk enabled, that the updating
> of 'jiffies' relies on HW timer's periodic interrupt, but early printk
> will disable interrupt during printing and cause some timer interrupts
> lost, and hence big lagging in 'jiffies'. Rui once proposed a patch to
> prevent 'jiffies' from being a watchdog due to it unreliability [1].
>
> And I think skipping the watchdog check one time when detecting some
> abnormal condition won't hurt the overall check much.

Works for me!

Thanx, Paul

> [1]. https://lore.kernel.org/lkml/bd5b97f89ab2887543fc262348d1c7cafcaae536.camel@xxxxxxxxx/
>
> Thanks,
> Feng
>
> > Thoughts?
> >
> > Thanx, Paul