Re: NMI errors in 2.0.30??

Stephen Costaras (stevecs@chaven.com)
Sat, 26 Apr 1997 19:49:14 -0500 (CDT)


> Try de-tuning your cache/RAM in your machines BIOS and see if the problems go
> away. It wouldn't suprise me at all if the Tyan corporation eeks some of that
> extra speed out of the RAM and the machine by overtuning RAM speeds on the
> knowledge that DOS, Winbloze, etc. run too slowly to cause problems. On the
> other hand, even a modest increase in the speed of linux has been known to
> cause these problems in the past (and they often occur in the ext2 code, both
> during checks and during operation). On normal machines, this would result in
> corruption, with parity and ECC RAM, it gets caught with these messages (ECC
> corrects it, I don't believe the parity RAM does anything but note the problem
> and we still get the occasional corruption). If you can de-tune your RAM or
> cache and the problem goes away, then it's a fairly solid indicatory that
> 2.0.30 is slightly faster than 2.0.29 and it's causing marginal memory setups
> to break.

Ok, I had some time to do some more testing. I've used three different sets of
memory (4x32meg FPM Parity simms). All memory has been in use for over 5 months
under various other kernels with no problems (much of which under very heavy load).

All my systems are comprised of the following:
Tyan S1668, w/2 PPro (200mhz, 256k cache)
128mb ram (FPM, Parity)
Tyan BIOS v3.03, NO powersaving turned on in bios, ECC enabled
Buslogic BT-958 controller
Digital DE500AA ethernet card
Monilithic kernel

Using a stripped (RAID_0) disk comprised of 2 fast/wide seagate barracuda 4gigs and
testing between kernel v2.0.29 & 2.0.30. I did the following. Filled up the
drive using articles from my news server (about 1,500,000 files). Rebooted under
v2.0.29 w/ ECC enabled and ran several fsck's, across disk, no problems. Ran several
badblocks -w's across disk, also no problems. Recopied all files over to new volume
and ran another fsck. (did this routine 5 times, no errors).

Rebooted system, turned Parity mode (non ECC) on in BIOS, still under v2.0.29 ran
above tests, no problems.

Booted (ECC) w/ 2.0.30 went through the same procedure, received NMI on boot (I have all my
disks to auto fsck when mount-count is 1) when fscking the volume. Dropped down to
maintence mode. remounted root as r/w. fscked RAID_0 volume, received several ext2
errors and fsck process died w/ Sig 11.
Rebooted system again, same configuration after removing all disks from fstab except root.
Ran Same routine as above, system died w/ 2 SIG11 errors (NMI) out of 5 iterations.

Rebooted system (2.0.30) w/ Parity mode enabled in BIOS. No problems with five iterations.

Now I'm a complete layman when dealing with kernel/hardware interactions, but this 'looks'
like the kernel can't understand the ECC mode in the Tyan BIOS. ???? Can anyone help shed
some more light here? Any suggestions on a more quantative test?

Stephen Costaras
stevecs@chaven.com