Re: [Intel-wired-lan] [PATCH] igc: Ignore AER reset when device is suspended

From: Kai-Heng Feng
Date: Tue Jun 27 2023 - 04:13:15 EST


On Thu, Jun 22, 2023 at 9:11 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Thu, Jun 22, 2023 at 08:09:34AM +0300, Neftin, Sasha wrote:
> > On 6/21/2023 23:43, Bjorn Helgaas wrote:
> > > On Tue, Jun 20, 2023 at 08:36:36PM +0800, Kai-Heng Feng wrote:
> > > > When a system that connects to a Thunderbolt dock equipped with I225,
> > > > I225 stops working after S3 resume:
>
> > > > The issue is that the PTM requests are sending before driver resumes the
> > > > device. Since the issue can also be observed on Windows, it's quite
> > > > likely a firmware/hardwar limitation.
> > >
> > > I thought c01163dbd1b8 ("PCI/PM: Always disable PTM for all devices
> > > during suspend") would turn off PTM. Is that not working for this
> > > path, or are we re-enabling PTM incorrectly, or something else?
> >
> > I think we hit on the HW bug here. On some i225/6 parts, PTM requests are
> > sent before SW takes ownership of the device. This patch could help.
>
> Is there an erratum we can read? If this is needed to work around a
> hardware defect, there should be a comment in the code to that effect,
> and we should have a better understanding because there may be other
> scenarios (suspend/resume, hotplug, etc) that need similar changes.

Actually, similar message can be seen on hotplugging the device. The
AER message will be gone shortly after the driver done it's probing.

>
> (I know this patch is to work around a suspend/resume issue, but the
> change is in the AER error recovery path, so it doesn't quite fit
> together for me yet.)

This is something I really want to discuss.
This is not the first time that AER handling doesn't play well with
system resume because the error handling and resume routine can happen
at the same time. Some possible way going forward:
1) Serialize error recovery and resume routine.
- If error recovery happens first and it's a successful recovery,
does the resume callback still need to be called?
- If the device successfully resume, is the error recovery routine
still needed?
So I think the most plausible way is to call error recovery only if
the resume fails. Ignore the AER if resume success.

2) Disable the AER interrupt during suspend
- Since the AER is still recorded and AER interrupt gets enabled by
port driver before child device resuming, the error recovery/resume
race can still happen.
- So the port services resume routines can only be called after the
entire PCIe hierarchy is resumed.

3) Disable the AER service completely during suspend
- This is what's in my mind. If the AER is caused by firmware and
hardware (like most cases), the most feasible way is to workaround the
issue in the driver.

IMO ether 1) or 2) requires involvements that add little benefit. So
hopefully we can opt to 3).

>
> Are you saying the NIC sends PTM requests when it doesn't have PTM
> Enable set?

I think I mentioned during previous discussion. The PTM gets enabled
by the firmware/hardware on the TBT dock right on S3 resume.
The issue is also logged on Windows' Event Viewer, but hardware vendor
doesn't care at all since the device is still functional :)

>
> What exactly does it mean for "SW to take ownership of the device"?
> What PCIe transaction would tell the device the SW has taken
> ownership?

Please correct me if I am wrong, but Intel ethernet devices may need
the driver to perform some actions so the ownership can be switched
between software and firmware.

Kai-Heng

>
> So far this feels kind of hand-wavey.
>
> > > Checking pci_is_enable() in the .error_detected() callback looks like
> > > a pattern that may need to be replicated in many other drivers, which
> > > makes me think it may not be the best approach.
> > >
> > > > So avoid resetting the device if it's not resumed. Once the device is
> > > > fully resumed, the device can work normally.
> > > >
> > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=216850
> > > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
> > > > ---
> > > > drivers/net/ethernet/intel/igc/igc_main.c | 3 +++
> > > > 1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
> > > > index fa764190f270..6a46f886ff43 100644
> > > > --- a/drivers/net/ethernet/intel/igc/igc_main.c
> > > > +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> > > > @@ -6962,6 +6962,9 @@ static pci_ers_result_t igc_io_error_detected(struct pci_dev *pdev,
> > > > struct net_device *netdev = pci_get_drvdata(pdev);
> > > > struct igc_adapter *adapter = netdev_priv(netdev);
> > > > + if (!pci_is_enabled(pdev))
> > > > + return 0;
> > > > +
> > > > netif_device_detach(netdev);
> > > > if (state == pci_channel_io_perm_failure)