Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XXchipsets

From: Don Dutile
Date: Sun Mar 10 2013 - 21:32:22 EST


On 03/09/2013 05:20 PM, Myron Stowe wrote:
On Sat, Mar 9, 2013 at 1:49 PM, Neil Horman<nhorman@xxxxxxxxxxxxx> wrote:
On Mon, Mar 04, 2013 at 02:04:19PM -0500, Neil Horman wrote:
A few years back intel published a spec update:
http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf

For the 5520 and 5500 chipsets which contained an errata (specificially errata
53), which noted that these chipsets can't properly do interrupt remapping, and
as a result the recommend that interrupt remapping be disabled in bios. While
many vendors have a bios update to do exactly that, not all do, and of course
not all users update their bios to a level that corrects the problem. As a
result, occasionally interrupts can arrive at a cpu even after affinity for that
interrupt has be moved, leading to lost or spurrious interrupts (usually
characterized by the message:
kernel: do_IRQ: 7.71 No irq handler for vector (irq -1)

There have been several incidents recently of people seeing this error, and
investigation has shown that they have system for which their BIOS level is such
that this feature was not properly turned off. As such, it would be good to
give them a reminder that their systems are vulnurable to this problem.

Signed-off-by: Neil Horman<nhorman@xxxxxxxxxxxxx>
CC: Prarit Bhargava<prarit@xxxxxxxxxx>
CC: Don Zickus<dzickus@xxxxxxxxxx>
CC: Don Dutile<ddutile@xxxxxxxxxx>
CC: Bjorn Helgaas<bhelgaas@xxxxxxxxxx>
CC: Asit Mallick<asit.k.mallick@xxxxxxxxx>
CC: linux-pci@xxxxxxxxxxxxxxx

Ping, anyone want to Ack/Nack this?

Don's comment earlier seems to imply that this is a short term fix and
that a more long term fix may be coming soon. If that is the case
wouldn't we want to wait for the long term fix and just pull that in?

Myron

At the time of Neil's postings, multiple changes were being considered,
and we didn't know how long it would take to verify any one change.
Thus, Neil's patch was proposed to identify a known problem that was
being seen on multiple systems, and it was proposed so further
system issues wouldn't go mis-diagnosed.

We are still testing a minor change, and test results are positive so far.
After we're sure of its logic as it's results, it can be posted and
this patch can be removed if it is taken before we are certain.

As Prarit stated, we should be sure of changes in this area before
throwing this patch away for an option that could yield other failures.

- Don
Neil

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/