Re: [PATCH] kernel/watchdog: fix spurious hard lockups

From: Don Zickus
Date: Wed Jun 21 2017 - 09:41:06 EST


On Tue, Jun 20, 2017 at 02:33:09PM -0700, kan.liang@xxxxxxxxx wrote:
> From: Kan Liang <Kan.liang@xxxxxxxxx>
>
> Some users reported spurious NMI watchdog timeouts.
>
> We now have more and more systems where the Turbo range is wide enough
> that the NMI watchdog expires faster than the soft watchdog timer that
> updates the interrupt tick the NMI watchdog relies on.
>
> This problem was originally added by commit 58687acba592
> ("lockup_detector: Combine nmi_watchdog and softlockup detector").
> Previously the NMI watchdog would always check jiffies, which were
> ticking fast enough. But now the backing is quite slow so the expire
> time becomes more sensitive.
>
> For mainline the right fix is to switch the NMI watchdog to reference
> cycles, which tick always at the same rate independent of turbo mode.
> But this is requires some complicated changes in perf, which are too
> difficult to backport. Since we need a stable fix too just increase the
> NMI watchdog rate here to avoid the spurious timeouts. This is not an
> ideal fix because a 3x as large Turbo range could still fail, but for
> now that's not likely.

As this is an Intel problem, we should at least restrict it to
arch/x86/kernel/apic/hw_nmi.c. I don't want to penalize other arches yet.

>
> Signed-off-by: Kan Liang <Kan.liang@xxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and
> softlockup detector")
> ---
>
> The right fix for mainline can be found here.
> perf/x86/intel: enable CPU ref_cycles for GP counter
> perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86
> https://patchwork.kernel.org/patch/9779087/
> https://patchwork.kernel.org/patch/9779089/

Does that mean this fix is restricted to just -stable then? Otherwise I am
confused why we should take this patch, if you have a better fix above.

Cheers,
Don

>
> kernel/watchdog_hld.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
> index 54a427d1f344..0f7c6e758b82 100644
> --- a/kernel/watchdog_hld.c
> +++ b/kernel/watchdog_hld.c
> @@ -164,7 +164,7 @@ int watchdog_nmi_enable(unsigned int cpu)
> firstcpu = 1;
>
> wd_attr = &wd_hw_attr;
> - wd_attr->sample_period = hw_nmi_get_sample_period(watchdog_thresh);
> + wd_attr->sample_period = 3 * hw_nmi_get_sample_period(watchdog_thresh);
>
> /* Try to register using hardware perf events */
> event = perf_event_create_kernel_counter(wd_attr, cpu, NULL, watchdog_overflow_callback, NULL);
> --
> 2.11.0
>