Re: [PATCH] watchdog: Prefer use "ref-cycles" for NMI watchdog

From: Song Liu
Date: Mon May 15 2023 - 16:33:49 EST




> On May 15, 2023, at 3:33 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Fri, May 12, 2023 at 09:43:48AM -0700, Song Liu wrote:
>> On Fri, May 12, 2023 at 5:47 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>>
>>> On Tue, May 09, 2023 at 03:17:00PM -0700, Song Liu wrote:
>>>> NMI watchdog permanently consumes one hardware counters per CPU on the
>>>> system. For systems that use many hardware counters, this causes more
>>>> aggressive time multiplexing of perf events.
>>>>
>>>> OTOH, some CPUs (mostly Intel) support "ref-cycles" event, which is rarely
>>>> used. Try use "ref-cycles" for the watchdog. If the CPU supports it, so
>>>> that one more hardware counter is available to the user. If the CPU doesn't
>>>> support "ref-cycles", fall back to "cycles".
>>>>
>>>> The downside of this change is that users of "ref-cycles" need to disable
>>>> nmi_watchdog.
>>>
>>> Urgh..
>>>
>>> how about something like so instead; then you can use whatever event you
>>> like...
>>
>> Configuring this at boot time is not ideal for our use case. Currently, we have
>> some systems support ref-cycles and some don't. So this is one more kernel
>> argument we need to make sure to get correctly. This also means we cannot
>> change this setting without reboot.
>
> You can still add the fallback (with a suitable pr_warn() that the
> requested config is not valid or so).
>
>> Another idea I have is to use sysctl kernel.nmi_watchdog, so we can change
>> the event after boot. Would this work?
>
> Yeah, I suppose you can also extend the thing to allow runtime changes
> to the values, provided the NMI watchdog is disabled at the time or
> somesuch.

How about something like:

sysctl kernel.nmi_watchdog=0 => no nmi watchdog
sysctl kernel.nmi_watchdog=1 => use "cycles" nmi watchdog
sysctl kernel.nmi_watchdog=2 => use "ref-cycles" nmi watchdog

I know the values are arbitrary. But using raw code encoding seems an
overkill (for command line or sysctl).


By the way, current code for "sysctl kernel.nmi_watchdog" doesn't
report the error properly. For example, on an Intel host, with:

sysctl kernel.nmi_watchdog=0
perf stat -e cycles:D,cycles:D,cycles:D,cycles:D,cycles:D -a &
sysctl kernel.nmi_watchdog=1
kill $(pidof perf)

At this point, the watchdog is effectively disabled, because the perf
events failed to enable. But sysctl still show:

kernel.nmi_watchdog = 1

How about we use current version (plus error reporting fix suggested
by Andrew)? If this causes any user visible issues, we can add the
option to configure it via sysctl (and fix sysctl reporting at the
same time).

Thanks,
Song