Re: [PATCH v2] x86/PCI: Improve log message when IRQ cannot be identified

From: Brent Spillner
Date: Wed Feb 02 2022 - 17:14:01 EST


On Wed, Feb 2, 2022 at 6:13 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> IIUC pirq_enable_irq() is only used for non-ACPI, non-DT, non-Xen,
> non-Intel MID systems, so this is a real legacy path.
>
> I don't think it's really worth cluttering an error case in a path
> that should be rarely used in the first place.

I figured if people are getting this message, then they either have
broken hardware or are debugging something, and if the message is
trying to be useful then it can't hurt to mention other things that
might help. If the pirq path has become buggy or unsupportable in
modern kernels then it probably ought to be removed altogether; if it
still works, however rarely it might be needed these days, then it's
perhaps worth mentioning to those who might occasionally have a use
for it.

> Are you seeing a problem where you're getting the wrong error message
> today? Can we just fix that problem instead so no kernel parameter is
> needed in the first place?

I was trying to isolate intermittent ACPI errors and tried booting
with acpi=noirq, as seemingly the closest modern equivalent to the
acpi=ht that had solved, or at least half-split, similar issues for me
in the past. With noirq, a different PCIe device stopped working (MPT
Fusion driver not picking up any responses to doorbell interrupts),
and while reviewing dmesg I noticed that the PCI error messages were
suggesting a kernel option that wasn't appropriate for my x86_64
architecture. In an ideal world with no hardware or driver problems
these log messages should never even happen, but in the real world of
troubleshooting and debugging I think they can be useful, and if
they're going to be reported they might as well be correct.

Obviously, this is a very minor bug, affecting only logs rather than
behavior, and I'm sure there are more pressing things to worry about.
On the other hand, it also seems like a very easy and low-risk fix
that leaves the kernel in a slightly better state for future users and
developers. At any rate, the current state of the PCI code (a)
generates a message that recommends specific kernel parameters, (b)
does so even on builds for which those parameters are inappropriate,
(c) doesn't say anything to encourage bug reports, and (d) doesn't
warn about the risks of noirq, which could cause other problems to be
misattributed. So even though the alternate messages I drafted may not
be perfect, and might need to be tuned in the future based on patterns
in whatever bug reports come in, I'm still confident that they're an
improvement (and I'm open to further suggestions).