machine check errors

From: don fisher
Date: Thu Jan 12 2006 - 18:36:05 EST


I have a Tyan S2892 board with a pair Opteron 288 dual core cpus and 16GB dram. I receive the errors shown in the attached file, mcelog. It appears that these occur when the free memory becomes small, there is a lot in the cache, and a lot of IO.

The Tyan S2892 has an Nvidia Crush K8-04, which I think they call the southbridge. My errors appear to be related to the north bridge. There is an AMD 8131 PCI-X controller that runs the PCI slots. There is a 3WARE 9500-12 located in one of the PCI-X slots.

I have run Memtest86+-1.65 for 24 hours without errors. I recently upgraded the BIOS to V2.00 without any remarkable changes.

I am running 2.6.15 within a current Fedora Core4 configuration.

I would appreciate any advice as to how to proceed. I have not noticed any adverse behavior from the mce's. But that could be masked is data transfered or ???.

Could there be any connection with the memory cache? Thanks in advance for your assistance.

don
--

MCE 0
CPU 2 4 northbridge TSC 967b1992c66
ADDR 2a52cb5f0
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 40b9
bit32 = err cpu0
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d45cc00140080813 MCGSTATUS 0
MCE 1
CPU 2 4 northbridge TSC a101bbf7338
ADDR 2922df698
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 2
CPU 2 4 northbridge TSC ab885e5bdbe
ADDR 2922df698
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 0
CPU 2 4 northbridge TSC b60f17bf394
ADDR 2918d98d0
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 1
CPU 2 4 northbridge TSC c095ba23a7e
ADDR 2918cdff0
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 40b9
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d45cc00040080a13 MCGSTATUS 0
MCE 2
CPU 2 4 northbridge TSC cb1c7387269
ADDR 2bf0cfa50
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 3
CPU 2 4 northbridge TSC d5a315f1a34
ADDR 2900df990
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 20e8
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d474400020080a13 MCGSTATUS 0
MCE 4
CPU 2 4 northbridge TSC e029b8504a6
ADDR 2900dd030
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 5
CPU 2 4 northbridge TSC eab071b5316
ADDR 291ac9d98
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 6
CPU 2 4 northbridge TSC f537141ab1c
ADDR 2918dfe78
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 40b9
bit33 = err cpu1
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d45cc00240080813 MCGSTATUS 0
MCE 7
CPU 2 4 northbridge TSC ffbdb67fd26
ADDR 2beac9010
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 40b9
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d45cc00040080a13 MCGSTATUS 0
MCE 0
CPU 2 4 northbridge TSC 10a446fe04fa
ADDR 2cfcc9870
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 40b9
bit32 = err cpu0
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d45cc00140080813 MCGSTATUS 0
MCE 1
CPU 2 4 northbridge TSC 114cb12451a2
ADDR 291ac9630
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit33 = err cpu1
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00260080813 MCGSTATUS 0
MCE 2
CPU 2 4 northbridge TSC 11f51cba9b82
ADDR 2c10cb010
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 20e8
bit32 = err cpu0
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d474400120080813 MCGSTATUS 0
MCE 3
CPU 2 4 northbridge TSC 129d86e0d26a
ADDR 294ec9390
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit32 = err cpu0
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00160080813 MCGSTATUS 0