Re: [RFC PATCH] watchdog: Adding softwatchdog

From: Peter.Enderborg
Date: Sat Apr 24 2021 - 09:05:10 EST


On 4/24/21 2:21 PM, Christophe Leroy wrote:
>
>
> Le 24/04/2021 à 12:25, Peter Enderborg a écrit :
>> This is not a rebooting watchdog. It's function is to take other
>> actions than a hard reboot. On many complex system there is some
>> kind of manager that monitor and take action on slow systems.
>> Android has it's lowmemorykiller (lmkd), desktops has earlyoom.
>> This watchdog can be used to help monitor to preform some basic
>> action to keep the monitor running.
>>
>> It can also be used standalone. This add a policy that is
>> killing the process with highest oom_score_adj and using
>> oom functions to it quickly. I think it is a good usecase
>> for the patch. Memory siuations can be problematic for
>> software that monitor system, but other prolicys can
>> should also be possible. Like picking tasks from a memcg, or
>> specific UID's or what ever is low priority.
>
>
> I'm nore sure I understand the reasoning behind the choice of oom logic to decide which task to kill.
>
This is not using oom logic to pick a task to kill, it is using oom functions to free resources fast.

The oom is also to slow. So there are userspace solutions to start removing processes before it starts to slow down.

In for example Ubuntu and Fedora a process called earlyoom is running. On Android there is lmkd. However
allocation can be huge fast. For example starting a camera. So what then can happen is that the service that
is there to remove applications that is not needed can get starved. They do a lot of operations to that needs
memory and by this they also get slow.  In worst case it can cause a oom. Oom kills things randomly and
it will cause a android phone to reboot if it kills wrong things. When it get slow it can't kick the wd and
we can free up resources from within kernel. To get current version to work there is very high margins wasting
a lot of memory to be "safe".


> Usually a watchdog will detect if a task is using 100% of the CPU time. If such a task exists, it is the one running, not another one that has huge amount of memory allocated by spends like 1% of CPU time.
>
Watchdogs detects that you does not feed it. 
> So if there is a task to kill by a watchdog, I would say it is the current task.


Current task?  We usually have many cpu's. But the idea is that you should easily write a policy for that if that is what you want.


>
>
>
> Another remark: you are using regular timers as far as I understand. I remember having problems with that in the past, it required the use of hrtimers. I can't remember the details exactly but you can look at
> commit https://github.com/linuxppc/linux/commit/1ff688209


That I definitely need to look in to.


> Christophe