Re: [PATCH v11 1/8] PCI/ERR: Update error status after reset_link()

From: Kuppuswamy Sathyanarayanan
Date: Fri Jan 03 2020 - 20:05:29 EST



On 1/3/20 4:34 PM, Bjorn Helgaas wrote:
On Thu, Dec 26, 2019 at 04:39:07PM -0800, sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx wrote:
From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>

Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") uses
reset_link() to recover from fatal errors. But, if the reset is
successful there is no need to continue the rest of the error recovery
checks. Also, during fatal error recovery, if the initial value of error
status is PCI_ERS_RESULT_DISCONNECT or PCI_ERS_RESULT_NO_AER_DRIVER then
even after successful recovery (using reset_link()) pcie_do_recovery()
will report the recovery result as failure. So update the status of
error after reset_link().
I like the part about updating "status" with the result of
reset_link(), and I split that into its own patch because it
seems like a fix that *can* be separated.

But I'm not convinced that we should skip the ->slot_reset()
callbacks if the reset_link() was successful.
If reset_link() call is successful then the result value will be
"PCI_ERS_RESULT_RECOVERED". So even if you proceed with
rest of the code, slot_reset() will never get called right ?
According to
Documentation/PCI/pci-error-recovery.rst, we should call
->slot_reset() after completion of the reset.

For example, rsxx_err_handler implements ->slot_reset(), but
not ->resume(). If we reset the device, we'll claim success and
return, but we won't call rsxx_slot_reset(), which does a bunch
of important-looking recovery stuff.

If pci-error-recovery.rst is wrong, we should fix that (after
auditing all the drivers to make sure they match).

Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
Cc: Ashok Raj <ashok.raj@xxxxxxxxx>
Cc: Keith Busch <keith.busch@xxxxxxxxx>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
Acked-by: Keith Busch <keith.busch@xxxxxxxxx>
---
drivers/pci/pcie/err.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index b0e6048a9208..53cd9200ec2c 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -204,9 +204,12 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
else
pci_walk_bus(bus, report_normal_detected, &status);
- if (state == pci_channel_io_frozen &&
- reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
- goto failed;
+ if (state == pci_channel_io_frozen) {
+ status = reset_link(dev, service);
+ if (status != PCI_ERS_RESULT_RECOVERED)
+ goto failed;
+ goto done;
+ }
if (status == PCI_ERS_RESULT_CAN_RECOVER) {
status = PCI_ERS_RESULT_RECOVERED;
@@ -228,6 +231,7 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
if (status != PCI_ERS_RESULT_RECOVERED)
goto failed;
+done:
pci_dbg(dev, "broadcast resume message\n");
pci_walk_bus(bus, report_resume, &status);
--
2.21.0

--
Sathyanarayanan Kuppuswamy
Linux kernel developer