Re: [PATCH v2] clocksource: Warn if too many missing ticks are detected

From: Thomas Gleixner
Date: Wed Sep 19 2018 - 03:53:54 EST


On Tue, 18 Sep 2018, Waiman Long wrote:

> The clocksource watchdog, when running, is scheduled on all the CPUs in
> the system sequentially on a round-robin fashion with a period of 0.5s.
> A bug in the 4.18 kernel is causing missing ticks when nohz_full
> is specified. Under some circumstances, this causes the watchdog to
> incorrectly state that the TSC is unstable because of counter overflow
> in the hpet watchdog clock source after a few minutes delay.
>
> That particular bug is fixed by the 4.19 commit 7059b36636beab ("sched:
> idle: Avoid retaining the tick when it has been stopped"). To make it
> easier to catch this kind of bug in the future, a check is added to see
> if there is too much delay in the invocation of the watchdog callback
> and print a warning once if it happens.

Second thoughts on this. Putting the check into the clocksource watchdog is
the wrong place as it's just checking at a place where the symptom
shows. What about putting it right to the source, i.e. in the timer wheel
as it does not depend on the clocksource watchdog being active. The
clocksource watchdog triggering is just one of the symptoms, but in general
timers being massively late is not a good thing.

Thanks,

tglx