up kernel stable, but smp kernel randomly reboots - nfsroot - asus cur_dls

From: Ryan Sweet (rsweet@atos-group.nl)
Date: Thu Jul 19 2001 - 07:05:51 EST


I posted previously about having problems with random reboots on nfsroot
nodes across kernels 2.2.18 - 2.4.6 (all kernels exhibit the same
problem - after X amount of time, where x is usually < 24 hours, the
system just reboots).

When I run the systems with uniprocessor kernels, the problem does not
occur.

When the smp kernel is booted with noapic, the apic errors go away. Other
posts I read about smp apic problems seemed to indicate that they received
hundreds of messages in a short period of time - I was getting maybe seven
or eight over the course of several hours.

I can not locate any references on the net to others having trouble with
SMP in asus cur_dls boards or with the ServerWorks chipset.

Is it possible that there is some interaction between smp and nfsroot and
cur_dls that is causing the problem (all of my other cur_dls boards are
using a local disk)? I've tried wrapping my head around the the nfs code
to search for smp specific problems, and while I understand a lot more of
it now than I did before, it is still mostly beyond my immediate
comprehension.

Is it possible that this is a power/cpu voltage problem? If so, would a
ups be a solution?

Is is possible that the whole batch of 10 motherboards
is broken somehow (we have oodles of other asus cur_dls smp systems that
don't have problems, just this cluster)?

Are there any suggestions as to further troubleshooting options?

I am working on booting with a tftp downloaded ramdisk as the root, to
eliminate nfsroot from the equation, but I am skeptical as to whether this
will actually help anything.

regards,
-ryan

-- 
Ryan Sweet <ryan.sweet@atosorigin.com>
Atos Origin Engineering Services
http://www.aoes.nl

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Jul 23 2001 - 21:00:12 EST