Re: [PATCH 1/2] PCI/AER: Disable AER interrupt during suspend

From: Kai-Heng Feng
Date: Fri Jul 23 2021 - 03:05:43 EST


On Fri, Jul 23, 2021 at 1:24 PM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
>
> On Thu, Jul 22, 2021 at 05:23:51PM -0500, Bjorn Helgaas wrote:
> > Marking both of these as "not applicable" for now because I don't
> > think we really understand what's going on.
> >
> > Apparently a DMA occurs during suspend or resume and triggers an ACS
> > violation. I don't think think such a DMA should occur in the first
> > place.
> >
> > Or maybe, since you say the problem happens right after ACS is enabled
> > during resume, we're doing the ACS enable incorrectly? Although I
> > would think we should not be doing DMA at the same time we're enabling
> > ACS, either.
> >
> > If this really is a system firmware issue, both HP and Dell should
> > have the knowledge and equipment to figure out what's going on.
>
> DMA on resume sounds really odd. OTOH the below mentioned case of
> a DMA during suspend seems very like in some setup. NVMe has the
> concept of a host memory buffer (HMB) that allows the PCIe device
> to use arbitrary host memory for internal purposes. Combine this
> with the "Storage D3" misfeature in modern x86 platforms that force
> a slot into d3cold without consulting the driver first and you'd see
> symptoms like this. Another case would be the NVMe equivalent of the
> AER which could lead to a completion without host activity.

The issue can also be observed on non-HMB NVMe.

>
> We now have quirks in the ACPI layer and NVMe to fully shut down the
> NVMe controllers on these messed up systems with the "Storage D3"
> misfeature which should avoid such "spurious" DMAs at the cost of
> wearning out the device much faster.

Since the issue is on S3, I think the NVMe always fully shuts down.

Kai-Heng