[RFC] hangcheck-timer module

From: Joel Becker (Joel.Becker@oracle.com)
Date: Thu Nov 21 2002 - 15:17:11 EST


Folks,
        Attached is a module, hangcheck-timer. It is used to detect
when the system goes out to lunch for a period of time, such as when a
driver like qla2x00 udelays a bunch.
        The module sets a timer. When the timer goes off, it then uses
the TSC (warning: portability needed) to determine how much real time
has passed.
        On a normal system, the real elapsed time will be almost
identical to the expected timer duration. However, if a device decided
to udelay for 60 seconds (or some other circumstance), the module takes
notice. If the margin of error passes a threshold, the machine is
rebooted.
        The module is currently used in a cluster environment. After
some time out to lunch, the rest of the cluster will have given up on a
machine. If the machine suddenly comes back and assumes it is still
"live", bad things can happen.
        We can also see use for this in a debugging sense, for kernel
hangs as well as driver code. That's why I'm proposing it for general
inclusion.
        Comments? Thoughts?

Joel

Building:
        The module should happily build against most 2.4 kernels. The
usual module building compile line:
        gcc -I /scratch/jlbec/kernel/linux-2.4.20-rc2/include \
                -DMODULE -D__KERNEL__ -DLINUX -c -o hangcheck-timer.o \
                hangcheck-timer.c

Running:
        Load the module with insmod. There are two options.
"hangcheck_tick=<seconds>" specifies the timer timeout, and
"hangcheck_margin=<seconds" specifies the margin of error.

Joel

-- 

"Friends may come and go, but enemies accumulate." - Thomas Jones

Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Nov 23 2002 - 22:00:37 EST