Re: [PATCH pci-next] pci/edr: Ignore Surprise Down error on hot removal

From: Ethan Zhao
Date: Mon Mar 04 2024 - 21:19:35 EST


On 3/5/2024 3:33 AM, Smita Koralahalli wrote:
Hi Ethan,

On 3/4/2024 3:58 AM, Lukas Wunner wrote:
On Mon, Mar 04, 2024 at 04:08:19AM -0500, Ethan Zhao wrote:
Per PCI firmware spec r3.3 sec 4.6.12, for firmware first mode DPC
handling path, FW should clear UC errors logged by port and bring link
out of DPC, but because of ambiguity of wording in the spec, some BIOSes
doesn't clear the surprise down error and the error bits in pci status,
still notify OS to handle it. thus following trick is needed in EDR when
double reporting (hot removal interrupt && dpc notification) is hit.

Please correct me if I'm wrong.

When there is double reporting (hot removal interrupt && dpc notification), won't the DPC handler be called always which takes care of clearing the surprise down errors? Do we need it again from EDR handler?

My understanding, if firmware first mode is enabled, DPC driver wouldn't
be enabled, EDR is notified instead, though some of the common functions
are used in EDR, such as dpc_process_error() is called in edr_handle_event(),
but dpc_handler() isn't called, so does the dpc_handle_surprise_removal().

Thanks,
Ethan


Thanks
Smita


Please provide more detailed information about the hardware and BIOS
affected by this.


-static void dpc_handle_surprise_removal(struct pci_dev *pdev)
+bool  dpc_handle_surprise_removal(struct pci_dev *pdev)
  {
+    if (!dpc_is_surprise_removal(pdev))
+        return false;

This change of moving dpc_is_surprise_removal() into
dpc_handle_surprise_removal() seems unrelated to the problem at hand.

Please drop it if it's unnecessary to fix the issue.


--- a/drivers/pci/pcie/edr.c
+++ b/drivers/pci/pcie/edr.c
@@ -184,6 +184,9 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
          goto send_ost;
      }
  +    if (dpc_handle_surprise_removal(edev))
+        goto send_ost;
+
      dpc_process_error(edev);
      pci_aer_raw_clear_status(edev);

This seems to be the only necessary change.  Please reduce the
patch to contain only it and no other refactoring.

Please capitalize the "PCI/EDR: " prefix in the subject and add
a Fixes tag.

Thanks,

Lukas