Re: [PATCH v2] watchdog/mm: Allow dumping memory info in pretimeout

From: Vincent Whitchurch
Date: Wed Jun 14 2023 - 03:43:00 EST


On Mon, 2023-06-12 at 07:53 -0700, Guenter Roeck wrote:
> On 6/12/23 00:26, Vincent Whitchurch wrote:
> > On my (embedded) systems, the most common cause of hitting the watchdog
> > (pre)timeout is due to thrashing. Diagnosing these problems is hard
> > without knowing the memory state at the point of the watchdog hit. In
> > order to make this information available, add a module parameter to the
> > watchdog pretimeout panic governor to ask it to dump memory info and the
> > OOM task list (using a new helper in the OOM code) before triggering the
> > panic.
>
> Personally I don't think this is the right way of approaching this problem.
> First, the userspace task controlling the watchdog should run as realtime
> task, forced to be in memory, and not be affected by thrashing.

That may not be appropriate in all cases since you may want the watchdog
to hit when the system as a whole really is unusable.

> Second, the problem should be observable well before the watchdog fires.

Yes, there are ways to try to detect it earlier (e.g. PSI) and attempt
recovery, even if the kernel's OOM killer itself is very slow to react.

But if those attempts fail for whatever reason and we actually do end up
hitting the watchdog, something like this patch provides information
which is invaluable for diagnosing the problem.

> Last but not least, I don't think it is appropriate to intertwine
> watchdog code with oom handling code as suggested here.

The show_mem() function is in lib/ so that's outside of the OOM
handling. The oom_dump_tasks() function could perhaps be refactored and
moved to a neutral location so then we would avoid the intertwining.