Re: what are some more advanced error collection methods?

From: Al Niessner
Date: Wed May 06 2009 - 18:36:25 EST



Using a volt meter, I verified that the 5V and 12V are good and the
computer is running under a normal load. So, I am going to go with the
power supply being alright for now.

I changed the CPU temperature by a couple of degrees with no failure.
While I cannot rule this out, I am willing to lean toward a software
problem; meaning, the kernel is hard locking.

Now I just need some way to get some helpful information out if it so
that I can move toward a solution.

On Wed, 2009-05-06 at 14:17 -0700, Al Niessner wrote:
> I am running 2.6.27 on an AMD 64 x2 dual core 6000+. I have the OS
> installed its own disk (SATA) and have an mdraid (SATA) with 3 disks
> being mirrored for my critical data. I also have an mdraid with 2 disks
> being mirrored (USB but I wanted firewire) for very low rate data. Both
> mdraids are nfs mounted and use automount on top of that -- nothing
> peculiar about nfs and automount except that nfs is over two networks
> each with their own NIC. My problem is that every 36 hours the machine
> simply locks up. Here is what I find:
>
> 1) num lock light is on but was off prior to lock up
> 2) no response to beating the num and caps lock keys
> 3) no response to beating the sysreq key plus any sequences
> 4) nothing is recorded in kern.log, syslog, or any other log file
> in /var/log
> 5) cannot get to console because keyboard is dead
> 6) have to hold power switch for 10 seconds to get computer to turn off
> so the computer is not suspended (power management is not installed
> anyway)
> 7) when computer is rebooted, the mdraids are usually clean (no resync)
> 8) did a memtest and it passes
>
> Since nothing showed up in the logs and I could not read the console, I
> found an old computer and connected the one I care about to it via
> ttyS0. Now I have the console even though the keyboard is dead. However,
> when the lock up occurs, there is absolutely no output to my RS232
> console. I put a pulse onto the console via /dev/console and get stuff
> right up until the change of state, but no panic shows up. On reboot, I
> start getting characters from the kernel immediately. Hence, I have to
> conclude that the serial connection is viable, but there is simply no
> output from the kernel.
>
> So, I have tried all of the simple stuff that I know about or found via
> google. Now I would like some more advanced ways of trying to pry
> helpful information from a dying kernel. Are there more advanced tools,
> tricks, or secrets for collecting fault information?
>
> Any and all help is appreciated in advance.
>
> One last item, I am still working on determining if this is a hardware
> or software problem. The voltages look resonable and the room is
> thermally stable to +/- 1C. So, I am having a hard time blaming
> hardware.
>
--
Al Niessner
818.354.0859

--------
| dS | >= 0
--------

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/