Re: linux SMP stability or lack thereof

Ricardo Galli Granada (gallir@atlas-iap.es)
Wed, 30 Sep 1998 00:26:02 +0200 (MET)


> > Is this a known problem to the kernel hackers or is it> a new problem?
> There is definitely at least one 2.0.x SMP lockup problem I've seen
> several times and we have EIP traces off. It however isnt a blank screen
> dead machine lockup, its a "Deadlock detected on ....." message in each
> of the reports it tallies to or alternatively its a "live machine
nothing
> happening" livelock. Again predictable.
> Finally try with 2.0.29 kernel images. The only "major" SMP change in
> recent history is about 2.0.30 when Leonard added some IRQ
forwardingfacilities.

As I posted in previous messages, there are definively some reproductable
lockup problems with 2.0 and SMP on some motherboards.

I was having ones with a RC440LX motherboard (2xPII300MHz), the machine
locked-up, blank screen, no noise...

According to Doug, the AIC7xxx is SMP safe but I tried the driver with
the following different combinations (SCSI and NIC) (with standard,
clean, no modules kernel, no forwarding, just plain TCP/IP and libc5,
gcc 2.7.2.1):

7880UW+Intel EtherExpress
7880UW+3Com905
2940UW+Intel EtherExpress
2940UW+3Com905

on a 10 (ten) Mbits LAN.

I could lockup the server *always*, just doing a ping flood from another
Linux (clean Pentium 133) while compiling the kernel.

I tried with different SCSI option (BIOS, no BIOS, reset, no reset) and
the results are equivalent. Then I tried reducing the amount the memory
available to the kernel (via mem=xxxMB) in lilo.conf. I reduced the memory
to 246 MB (the machine has 256) and the machine died anyway.

Finally I was tired (and scared, it was a production machine, before
putting it on production and tested everything during ten days on a
private network and it never died) and disabled SMP in the Makefile, so
now it's crippled but alive.

gallir@star:/home/people/gallir > w
12:07am up 42 days, 6:32, 2 users, load average: 0.19, 0.12, 0.04

I must say that with 2.0.34 was more stable than with 2.0.35. With 2.0.33
and squid the machine died every 2 days.

So, I may bet that is not:

- temperature problem nor
- memory problem nor
- network card driver.

Doug believes that the 5.0.19 aic7xxx driver is SMP safe and perhaps is
something deeper in the kernel (I tried 5.1.0pre10 but does not work,
reset problems, seems to be automatic termination issue).

I am going to try with s Buslogic 958, but they are delivered in "few"
months in Europe (I requested for one to the spanish buslogic distributor
one month ago, still waiting).

SMP FAQ maintainers, you may probably like to put this motherboard in
the bad list (another one...)

--
Ricardo Galli

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/