Seeking Linux watchdog design advice to trouble shoot mystory silentreboot issue

From: Vincent Li
Date: Mon Dec 05 2011 - 14:55:47 EST


Hi,

we have a complex system with a large number of processes running
simutanously. If any of the processes gets into a faulty state and
hangs or consumes more than its fair share of the system resources,
the other processes may not get a chance to run, and the whole system
can hang, interrupting the system functionality and user traffic.

In order to prevent the system from hanging, We uses a host watchdog
mechanism to make sure the system can detect and get out of a hanging
state with a host reboot. This feature is implemented with the
hardware watchdog counter. The counter is initialized to a fixed
number, and counts down automatically unless it is reset by a special
user space software process, say watchdog. When watchdog gets a chance
to run, it touches the hardware watchdog counter. If the system gets
too busy, and the watchdog process does not get a chance to run before
the hardware watchdog counter reaches 0, the host is rebooted in an
attempt to recover from the hanging system.

Recently, there have been a number of cases in which the units
silently rebooted without much information logged in the system log.
In most silent reboot cases, the unit was rebooted because of a host
hardware watchdog reboot, and because of the nature of the host
watchdog reboots, not much information about the current states of the
system is preseved or logged before the hardware reboot takes effect.
After the reboot, it is hard to analyze the real cause of the system
being hang.

we have been thinking moving the user space watchdog process to kernel
and invoke some kernel function like dump_stack to show the hanging
process stack trace before hardware reset. we have tried
drivers/watchdog/softdog.c as prove of this idea, but we are unable
to get the hanging process stack trace. we also tried to use kdump in
kernel, but we are unable to run kdump in kernel for some other
technical reason.

CPU and memory control group features are not considered at this stage
because it is too invasive to change in our custom kernel.

could you share your experience on this kind of issue, we really
would like to be able to find out which faulty process caused the CPU
to deschedue user space watchdog process and dump the stack trace of
that faulty process.

Thank you in advance!

Vincent
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/