Re: 2.6.33: pci 0000:00:00.0: address space collision / spontaenousreboots

From: Justin Piszcz
Date: Fri Mar 12 2010 - 17:07:11 EST




On Fri, 12 Mar 2010, Bjorn Helgaas wrote:

On Friday 12 March 2010 01:32:17 pm Justin Piszcz wrote:

Even with all boards removed:
[    0.133537] pci 0000:00:00.0: address space collision: [mem
0xe0000000-0xffffffff 64bit] already in use

00:00.0 Host bridge: ATI Technologies Inc RD790 Northbridge only dual slot
PCI-e_GFX and HT3 K8 part

how about current linus' tree with pci=nocrs or pci=use_crs?

Hi, I saw your second e-mail, so it sounds like a bad board or something
that Linux does not have a quirk for yet, but in any case, per your
recommendations:

pci=nocrs:
http://home.comcast.net/~jpiszcz/20100312/dmesg-pci-nocrs.txt

pci=use_crs:
http://home.comcast.net/~jpiszcz/20100312/dmesg-use-crs.txt

No collision when pci=use_crs is used, BUT the system still crashes.

Instead of collision, it says this:

[ 0.133598] PCI: pci_cache_line_size set to 64 bytes
[ 0.133603] pci 0000:00:00.0: BAR 3: reserving [mem 0xe0000000-0xffffffff flags 0x120204] (d=0, p=0)
[ 0.133606] pci 0000:00:00.0: no compatible bridge window for [mem 0xe0000000-0xffffffff 64bit]
[ 0.133610] pci 0000:00:00.0: can't reserve [mem 0xe0000000-0xffffffff 64bit]
[ 0.133617] pci 0000:00:11.0: BAR 0: reserving [io 0xff00-0xff07 flags 0x20101] (d=0, p=0)

[ 0.133735] Expanded resource reserved due to conflict with PCI Bus 0000:00

Let's look at some of these messages:

pci_root PNP0A03:00: host bridge window [mem 0x40000000-0xfed0ffff]

That looks normal to me. If you could boot a current upstream kernel,
e.g., 2.6.34-rc1, I think it might print more information about your
AMD PCI address space routing. BTW, it looks like you have four CPUs,
but your kernel is only compiled to support two.
The latest e-mail shows similar messages (2.6.34-rc1).


pci 0000:00:00.0: reg 1c: [mem 0xe0000000-0xffffffff 64bit]
pci 0000:00:00.0: no compatible bridge window for [mem 0xe0000000-0xffffffff 64bit]
pci 0000:00:00.0: can't reserve [mem 0xe0000000-0xffffffff 64bit]

These are just telling us that the device BAR 0xe0000000-0xffffffff
doesn't fit inside the bridge window of 0x40000000-0xfed0ffff. I don't
know why the device has that weird-looking BAR, but that by itself
shouldn't be fatal because we don't have any drivers that try to use
that BAR.
OK- btw, keep in mind all boards have been removed from the system, also,
the serial port, 1394, some other things, floppy, etc, have been disabled
in the motherboard, to free up IRQs if that was the cause, no difference.
Also tried many pci= options, noapic, acpi=off, nothing helps.


Expanded resource reserved due to conflict with PCI Bus 0000:00

This comes from e820_reserve_resources_late(). I wish it were a
more useful message and showed the actual conflict and what was
expanded, but I don't think it's a problem in itself.
Ok..


pnp 00:0a: disabling [mem 0x000f0000-0x000f3fff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]

We failed to reserve the 0xe0000000-0xffffffff region above, so we just
cleared out the resource. It keeps the same size, so it ends up at
0x00000000-0x1fffffff, where it appears to conflict with a lot of PNP
devices. But this isn't a real conflict; it's just Linux being stupid
because we don't handle that PCI resource correctly.
Ok..


So the messages *look* alarming, but I don't see anything there that
should cause a spontaneous reboot.
The system stays up for 5min, 10min, 1-2hrs sometimes and then the box
will reboot, even with various kernel debugging enabled, nothing is captured,
have not setup netconsole for this server yet, but I don't think that would
get anything either due to how this error occurs. It is a brand new motherboard/memory/etc. What is interesting is running stress, there are
no issues, but I was able to make it crash by reading all of the drives
on the system and running lilo at the same time, that was the only time I
made it crash on-demand, or "reboot"- as there are no logs/etc of the crash.

Is this a regression? Did the system ever work reliably with any
Linux kernel? If not, I'd suspect a hardware problem like bad memory.
The memory has been tested, latest memtest from the latest System Rescue
CD, it has 1 stick of memory (1GB), it passed the memory test successfuly,
there were no errors.


Bjorn

Thanks for the response..

Justin.