Re: [PATCH v4] PCI: pciehp: Fix MSI interrupt race

From: Bjorn Helgaas
Date: Sat Mar 28 2020 - 17:27:29 EST


On Wed, Feb 19, 2020 at 03:31:13PM +0100, Lukas Wunner wrote:
> From: Stuart Hayes <stuart.w.hayes@xxxxxxxxx>
>
> Without this commit, a PCIe hotplug port can stop generating interrupts
> on hotplug events, so device adds and removals will not be seen:
>
> The pciehp interrupt handler pciehp_isr() reads the Slot Status register
> and then writes back to it to clear the bits that caused the interrupt.
> If a different interrupt event bit gets set between the read and the
> write, pciehp_isr() returns without having cleared all of the interrupt
> event bits. If this happens when the MSI isn't masked (which by default
> it isn't in handle_edge_irq(), and which it will never be when MSI
> per-vector masking is not supported), we won't get any more hotplug
> interrupts from that device.
>
> That is expected behavior, according to the PCIe Base Spec r5.0, section
> 6.7.3.4, "Software Notification of Hot-Plug Events".
>
> Because the Presence Detect Changed and Data Link Layer State Changed
> event bits can both get set at nearly the same time when a device is
> added or removed, this is more likely to happen than it might seem.
> The issue was found (and can be reproduced rather easily) by connecting
> and disconnecting an NVMe storage device on at least one system model
> where the NVMe devices were being connected to an AMD PCIe port (PCI
> device 0x1022/0x1483).
>
> Fix the issue by modifying pciehp_isr() to loop back and re-read the
> Slot Status register immediately after writing to it, until it sees that
> all of the event status bits have been cleared.
>
> Signed-off-by: Stuart Hayes <stuart.w.hayes@xxxxxxxxx>
> [lukas: drop loop count limitation, write "events" instead of "status",
> don't loop back in INTx and poll modes, tweak code comment & commit msg]
> Signed-off-by: Lukas Wunner <lukas@xxxxxxxxx>

Applied to pci/hotplug for v5.7, thanks!

> ---
> v4 (lukas):
> * drop "MAX_ISR_STATUS_READS" loop count limitation
> * drop unnecessary braces around PCI_EXP_SLTSTA_* flags
> * write "events" instead of "status" variable to Slot Status register
> to avoid unnecessary loop iterations if the same bit gets set
> repeatedly
> * don't loop back in INTx and poll modes
> * shorten and tweak code comment & commit message
>
> v3:
> * removed pvm_capable flag (from v2) since MSI may not be masked
> regardless of whether per-vector masking is supported
> * tweaked comments
>
> v2:
> * fixed ctrl_warn() call
> * improved comments
> * added pvm_capable flag and changed pciehp_isr() to loop back only when
> pvm_capable flag not set (suggested by Lukas Wunner)
>
> drivers/pci/hotplug/pciehp_hpc.c | 26 ++++++++++++++++++++------
> 1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 8a2cb1764386..f64d10df9eb5 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -527,7 +527,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
> struct controller *ctrl = (struct controller *)dev_id;
> struct pci_dev *pdev = ctrl_dev(ctrl);
> struct device *parent = pdev->dev.parent;
> - u16 status, events;
> + u16 status, events = 0;
>
> /*
> * Interrupts only occur in D3hot or shallower and only if enabled
> @@ -552,6 +552,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
> }
> }
>
> +read_status:
> pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status);
> if (status == (u16) ~0) {
> ctrl_info(ctrl, "%s: no response from device\n", __func__);
> @@ -564,24 +565,37 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
> * Slot Status contains plain status bits as well as event
> * notification bits; right now we only want the event bits.
> */
> - events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
> - PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC |
> - PCI_EXP_SLTSTA_DLLSC);
> + status &= PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
> + PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC |
> + PCI_EXP_SLTSTA_DLLSC;
>
> /*
> * If we've already reported a power fault, don't report it again
> * until we've done something to handle it.
> */
> if (ctrl->power_fault_detected)
> - events &= ~PCI_EXP_SLTSTA_PFD;
> + status &= ~PCI_EXP_SLTSTA_PFD;
>
> + events |= status;
> if (!events) {
> if (parent)
> pm_runtime_put(parent);
> return IRQ_NONE;
> }
>
> - pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events);
> + if (status) {
> + pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events);
> +
> + /*
> + * In MSI mode, all event bits must be zero before the port
> + * will send a new interrupt (PCIe Base Spec r5.0 sec 6.7.3.4).
> + * So re-read the Slot Status register in case a bit was set
> + * between read and write.
> + */
> + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode)
> + goto read_status;
> + }
> +
> ctrl_dbg(ctrl, "pending interrupts %#06x from Slot Status\n", events);
> if (parent)
> pm_runtime_put(parent);
> --
> 2.24.0
>