RE: HW power fault defect cause system hang on kernel 5.4.y

From: Bao, Joseph
Date: Tue Nov 09 2021 - 03:00:22 EST


Hi Lukas/Stuart,
Want to follow up with you whether the system hang is expected when HW has a defect keeping PCI_EXP_SLTSTA_PFD always HIGH.


Regards
Joseph

-----Original Message-----
From: Bjorn Helgaas <helgaas@xxxxxxxxxx>
Sent: Wednesday, November 3, 2021 6:34 AM
To: Bao, Joseph <joseph.bao@xxxxxxxxx>
Cc: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>; linux-pci@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Stuart Hayes <stuart.w.hayes@xxxxxxxxx>; Lukas Wunner <lukas@xxxxxxxxx>
Subject: Re: HW power fault defect cause system hang on kernel 5.4.y

[+cc Stuart, author of 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt race"), Lukas, pciehp expert]

On Tue, Nov 02, 2021 at 03:45:00AM +0000, Bao, Joseph wrote:
> Hi, dear kernel developer,
>
> Recently we encounter system hang (dead spinlock) when move to kernel
> linux-5.4.y.
>
> Finally, we use bisect to locate the suspicious commit
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4667358dab9cc07da044d5bc087065545b1000df.

4667358dab9c backported upstream commit 8edf5332c393 ("PCI: pciehp:
Fix MSI interrupt race") to v5.4.69 just over a year ago.

> Our system has some HW defect, which will wrongly set
> PCI_EXP_SLTSTA_PFD high, and this commit will lead to infinite loop
> jumping to read_status (no chance to clear status PCI_EXP_SLTSTA_PFD
> bit since ctrl is not updated), I know this is our HW defect, but this
> commit makes kernel trapped in this isr function and leads to kernel
> hang (then the user could not get useful information to show what's
> wrong), which I think is not expected behavior, so I would like to
> report to you for discussion.

I guess this happens because the first time we handle PFD,
pciehp_ist() sets "ctrl->power_fault_detected = 1", and when power_fault_detected is set, pciehp_isr() won't clear PFD from PCI_EXP_SLTSTA?

It looks like the only place we clear power_fault_detected is in pciehp_power_on_slot(), and I don't think we call that unless we have a presence detect or link status change.

It would definitely be nice if we could arrange so this hardware defect didn't cause a kernel hang.

I think the diff below is the backport of 8edf5332c393 ("PCI: pciehp:
Fix MSI interrupt race").

> diff --git a/drivers/pci/hotplug/pciehp_hpc.c
> b/drivers/pci/hotplug/pciehp_hpc.c
> index 356786a3b7f4b..88b996764ff95 100644
> ---
> a/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tre
> e/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=ca767cf0152d18fc29
> 9cde85b18d1f46ac21e1ba
> +++ b/https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
> +++ /tree/drivers/pci/hotplug/pciehp_hpc.c?h=linux-5.4.y&id=4667358dab
> +++ 9cc07da044d5bc087065545b1000df
> @@ -529,7 +529,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
> struct controller *ctrl = (struct controller *)dev_id;
> struct pci_dev *pdev = ctrl_dev(ctrl);
> struct device *parent = pdev->dev.parent;
> - u16 status, events;
> + u16 status, events = 0;
>
> /*
> * Interrupts only occur in D3hot or shallower and only if enabled
> @@ -554,6 +554,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
> }
> }
>
> +read_status:
> pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status);
> if (status == (u16) ~0) {
> ctrl_info(ctrl, "%s: no response from device\n", __func__); @@
> -566,24 +567,37 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
> * Slot Status contains plain status bits as well as event
> * notification bits; right now we only want the event bits.
> */
> - events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
> - PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC |
> - PCI_EXP_SLTSTA_DLLSC);
> + status &= PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
> + PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC |
> + PCI_EXP_SLTSTA_DLLSC;
>
> /*
> * If we've already reported a power fault, don't report it again
> * until we've done something to handle it.
> */
> if (ctrl->power_fault_detected)
> - events &= ~PCI_EXP_SLTSTA_PFD;
> + status &= ~PCI_EXP_SLTSTA_PFD;
>
> + events |= status;
> if (!events) {
> if (parent)
> pm_runtime_put(parent);
> return IRQ_NONE;
> }
>
> - pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events);
> + if (status) {
> + pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events);
> +
> + /*
> + * In MSI mode, all event bits must be zero before the port
> + * will send a new interrupt (PCIe Base Spec r5.0 sec 6.7.3.4).
> + * So re-read the Slot Status register in case a bit was set
> + * between read and write.
> + */
> + if (pci_dev_msi_enabled(pdev) && !pciehp_poll_mode)
> + goto read_status;
> + }
> +
> ctrl_dbg(ctrl, "pending interrupts %#06x from Slot Status\n", events);
> if (parent)
> pm_runtime_put(parent);