RE: [PATCH] kernel/watchdog: fix spurious hard lockups

From: Liang, Kan
Date: Wed Jun 28 2017 - 09:24:26 EST



> > From: Kan Liang <Kan.liang@xxxxxxxxx>
> >
> > Some users reported spurious NMI watchdog timeouts.
> >
> > We now have more and more systems where the Turbo range is wide
> enough
> > that the NMI watchdog expires faster than the soft watchdog timer that
> > updates the interrupt tick the NMI watchdog relies on.
>
> AFAIR the watchdog doesn't rely on deferred timers so this would suggest
> that a standard hrtimer can expire much later than programmed, right?

The softlockup watchdog relies on hrtimers.
The hardlockup watchdog (NMI watchdog) relies on perf subsystem and
using unhalted CPU cycles.
When the softlockup watchdog expires, it updates the hrtimer_interrupts.
When the NMI watchdog expires, it will check the hrtimer_interrupts, and
determine if it's a hardlockup.
The design was to make the softlockup watchdog runs with 2.5 times the
rate of NMI watchdog. So it guarantees that the hrtimer_interrupts is
updated before the NMI watchdog expires.
That works well if Turbo-Mode is disabled.
However, when Turbo-Mode is enabled, unhalted CPU cycles might run
much faster than expected, even faster than softlockup watchdog.
So the softlockup watchdog will not get a chance to update the
hrtimer_interrupts, which will trigger false positives.


Thanks,
Kan

> If that is the case how come other parts of the system do not break. We do
> rely on hrtimers on many other places?
> --
> Michal Hocko
> SUSE Labs