Re: [PATCH 1/3] PCI: Add helper to check if any of ancestor device support D3cold

From: Kai-Heng Feng
Date: Mon Aug 28 2023 - 03:30:10 EST


On Sat, Aug 26, 2023 at 9:11 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Fri, Aug 25, 2023 at 09:39:48AM +0300, Mika Westerberg wrote:
> > On Fri, Aug 25, 2023 at 01:43:08PM +0800, Kai-Heng Feng wrote:
> > > On Fri, Aug 25, 2023 at 1:29 PM Mika Westerberg
> > > <mika.westerberg@xxxxxxxxxxxxxxx> wrote:
> > > > On Thu, Aug 24, 2023 at 09:46:00PM +0800, Kai-Heng Feng wrote:
> > > > > On Thu, Aug 24, 2023 at 7:57 PM Mika Westerberg
> > > > > <mika.westerberg@xxxxxxxxxxxxxxx> wrote:
>
> > > I think what Bjorn suggested is to keep AER enabled for D3hot, and
> > > only disable it for D3cold and S3.
> > >
> > > > > Unless there are cases when device firmware behave differently to
> > > > > D3hot? Then maybe it's better to disable AER for both D3hot, D3cold
> > > > > and system S3.
> > > >
> > > > Yes, this makes sense.
> > >
> > > I agree that differentiate between D3hot and D3cold unnecessarily make
> > > things more complicated, but Bjorn suggested errors reported by AER
> > > under D3hot should still be recorded.
> > > Do you have more compelling data to persuade Bjorn that AER should be
> > > disabled for both D3 states?
> >
> > Is there even an AER error that can happen when a device is in D3hot
> > (link is in L1) or D3cold (link is in L2/3)? I'm not an expert in AER
> > but AFAICT these errors are reported when the device is in active state
> > not when it is in low power state.
>
> I don't think a device in D3cold can signal its own errors. But the
> link transition to L2/L3 as a device goes to D3cold may cause the
> bridge above to log an error. And of course a config access to a
> device in D3cold will probably result in an Unsupported Request being
> logged by the bridge above it. I think these are the sorts of errors
> we do need to avoid or ignore somehow.

In addition to that, we can't really control what device behaves
during the D3hot (L2) transition.
The kernel can't control what the firmware on the device may respond.

>
> But Configuration and Message requests definitely happen in D3hot, and
> they can cause errors reported via AER. The spec (r6.0, sec 2.2.8)
> recommends that Messages be handled the same in D0-D3hot.
>
> PTM is an example of where we had errors being reported at suspend/
> resume because we had it configured incorrectly. If we disabled AER
> in D3hot we might not learn about that kind of configuration problem.
> That's what makes me think there's some value in keeping AER enabled
> in D3hot.

In this particular case, the firmware of the device gets power cycled
and starts sending PTM because of that.
For this case, we want to know the error happens, but in the meantime
there's nothing much can be done.

So is it reasonable to log Corrected Errors, but skipping the AER reset?

Kai-Heng

>
> Bjorn