Re: [PATCH v2 1/2] PCI/AER: Disable AER service when link is in L2/L3 ready, L2 and L3 state

From: Sathyanarayanan Kuppuswamy
Date: Sun Mar 20 2022 - 23:52:43 EST




On 3/20/22 7:38 PM, Kai-Heng Feng wrote:
On Sun, Mar 20, 2022 at 4:38 AM Sathyanarayanan Kuppuswamy
<sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx> wrote:



On 1/26/22 6:54 PM, Kai-Heng Feng wrote:
Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in
hint") enables ACS, and some platforms lose its NVMe after resume from

Why enabling ACS makes platform lose NVMe? Can you add more details
about the problem?

I don't have a hardware analyzer, so the only detail I can provide is
the symptom.
I believe the affected system was sent Intel, and there wasn't any
feedback since then.

Since your commit log refers to ACS, I think first we need to understand
following points.

1. Why we get ACSViol during S3 resume. Is this just a noise?
2. Why AER recovery fails?
3. Is this common for all platforms, or only happens in your test
platform?

If you are not clear about above points, I think you can submit this
patch as adding suspend/resume support to AER/DPC driver and not include
the issue about ACS.

From your commit log, the problem is not very clear.



S3:
[ 50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
[ 50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
[ 50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[ 50.947830] pcieport 0000:00:1b.0: device [8086:06ac] error status/mask=00200000/00010000
[ 50.947831] pcieport 0000:00:1b.0: [21] ACSViol (First)
[ 50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
[ 50.947843] nvme nvme0: frozen state error detected, reset controller

It happens right after ACS gets enabled during resume.

There's another case, when Thunderbolt reaches D3cold:
[ 30.100211] pcieport 0000:00:1d.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1d.0
[ 30.100251] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 30.100256] pcieport 0000:00:1d.0: device [8086:7ab0] error status/mask=00100000/00004000
[ 30.100262] pcieport 0000:00:1d.0: [20] UnsupReq (First)
[ 30.100267] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 08000052 00000000 00000000
[ 30.100372] thunderbolt 0000:0a:00.0: AER: can't recover (no error_detected callback)

no callback message means one or more devices in the given port does not
support error handler. How is this related to ACS?

This case is about D3cold, not related to ACS.
And no error_detected is just part of the message. The whole AER
message is more important.

Kai-Heng


[ 30.100401] xhci_hcd 0000:3e:00.0: AER: can't recover (no error_detected callback)
[ 30.100427] pcieport 0000:00:1d.0: AER: device recovery failed

So disable AER service to avoid the noises from turning power rails
on/off when the device is in low power states (D3hot and D3cold), as
PCIe spec "5.2 Link State Power Management" states that TLP and DLLP
transmission is disabled for a Link in L2/L3 Ready (D3hot), L2 (D3cold
with aux power) and L3 (D3cold).

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209149
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453
Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint")
Signed-off-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
---
v2:
- Wording change.

drivers/pci/pcie/aer.c | 31 +++++++++++++++++++++++++------
1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b270..e4e9d4a3098d7 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1367,6 +1367,22 @@ static int aer_probe(struct pcie_device *dev)
return 0;
}

+static int aer_suspend(struct pcie_device *dev)
+{
+ struct aer_rpc *rpc = get_service_data(dev);
+
+ aer_disable_rootport(rpc);
+ return 0;
+}
+
+static int aer_resume(struct pcie_device *dev)
+{
+ struct aer_rpc *rpc = get_service_data(dev);
+
+ aer_enable_rootport(rpc);
+ return 0;
+}
+
/**
* aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
* @dev: pointer to Root Port, RCEC, or RCiEP
@@ -1433,12 +1449,15 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
}

static struct pcie_port_service_driver aerdriver = {
- .name = "aer",
- .port_type = PCIE_ANY_PORT,
- .service = PCIE_PORT_SERVICE_AER,
-
- .probe = aer_probe,
- .remove = aer_remove,
+ .name = "aer",
+ .port_type = PCIE_ANY_PORT,
+ .service = PCIE_PORT_SERVICE_AER,
+ .probe = aer_probe,
+ .suspend = aer_suspend,
+ .resume = aer_resume,
+ .runtime_suspend = aer_suspend,
+ .runtime_resume = aer_resume,
+ .remove = aer_remove,
};

/**

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer