[PATCH 0/2] PCI: Rework error reporting with PCIe failed link retraining

From: Maciej W. Rozycki
Date: Fri Feb 09 2024 - 20:44:05 EST


Hi,

This patch series addresses issues observed by Ilpo as reported here:
<https://lore.kernel.org/r/aa2d1c4e-9961-d54a-00c7-ddf8e858a9b0@xxxxxxxxxxxxxxx/>,
one with excessive delays happening when `pcie_failed_link_retrain' is
called, but link retraining has not been actually attempted, and another
one with an API misuse caused by a merge mistake.

See individual change description for further details; 1/2 supersedes:
<https://patchwork.kernel.org/project/linux-pci/patch/20240202134108.4096-1-ilpo.jarvinen@xxxxxxxxxxxxxxx/>,
and 2/2 supersedes:
<https://patchwork.kernel.org/project/linux-pci/patch/20240208132205.4550-1-ilpo.jarvinen@xxxxxxxxxxxxxxx/>.

Unfortunately I cannot verify the changes anymore beyond just checking
that the system `pcie_failed_link_retrain' was intended for still boots,
because something happened that makes the problematic link not to work at
all.

The system was up for 88 days and the link continued working as I was
logged in over a serial line wired through a PCIe serial option card
further downstream and I communicated over the line just fine to log out
in preparation for a reboot. After reboot the link did not respond and
after several further attempts, including reboots and power cycles, the
link still does not respond, LBMS is never set and I couldn't ever observe
LT being set either. This affects U-Boot too, as previously it reported:

PCIE-0: Link up (Gen1-x8, Bus0)
PCI Autoconfig: 02.03.00: Downstream link non-functional
PCI Autoconfig: 02.03.00: Retrying with speed restricted to 2.5GT/s...
PCI Autoconfig: 02.03.00: Succeeded!

and now it only reports:

PCIE-0: Link up (Gen1-x8, Bus0)

Interestingly enough the system had its mainboard replaced those 3 months
ago to deal with an unrelated problem, and with the new mainboard in place
I already had issues with the option cards downstream from the PCIe switch
immediately wired to 02.03.0. I had to rewire and reseat the adapter and
cards several times before it started working reliably. Maybe something
has happened to the adapter board with the PCIe switch that caused it to
stop working, hopefully permanently. Perhaps it has something to do with
the power supply connection, which is via an FDC/Berg connector, not my
favourite one.

I have four such adapter boards total, so I can try and see if I am able
to revive the original one or use a replacement one, but it won't happen
right away, as I have the system installed in a remote lab ~1000mi/1600km
away from me. I'll try to bring the system back to fully working order at
the next opportunity, but it is inconvenient to me to travel there right
now just to address this problem, so it'll be a couple of weeks and likely
more before I am able to say something. I hope it's not the new mainboard
(PCIe devices in the other slots work just fine).

Hopefully I'll be able fix it one way or another and will be able to
report on LBMS behaviour too, that is whether it retriggers with every
link training iteration or not.

Meanwhile the patches are hopefully obvious enough to apply.

Maciej