Opteron fatal machine check during PCI probe

From: Tim Hockin
Date: Wed Jun 16 2004 - 19:09:00 EST


Hey all,

I have a couple dual Opteron boxen that consistently gets an MCE during
PCI probing. This is from linux-2.6.6, but the EXACT same scenario happens
on a 2.4.x kernel.

The MCE shows that the error is an IO read, with the address 0xfdfc000cfe.
The RIP points to pci_conf1_read(), when we try to inw() from the PCI data
register.

This is called during the PCI probing, and stops the kernel dead in it's
tracks. The disassembly of the surrounding code is:

ffffffff802822c5: 89 ca mov %ecx,%edx
ffffffff802822c7: 83 e2 02 and $0x2,%edx
ffffffff802822ca: 66 81 c2 fc 0c add $0xcfc,%dx
ffffffff802822cf: 66 ed in (%dx),%ax

This all seems legit to me.

What is interesting is that the address 0xfdfc000cfe is correct in the
low-order 16 bits. The extra 0xfdfc000000 is what is puzzling to me, or
maybe it's a red herring.

I added a show_registers() to the MCE handler, and %rdx *really* is all
zeros, other than the 0xcfe.

If I disable MCE, then the system boot fine, and runs fine.

Anyone have any ideas?

Tim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/