Re: [BUG] net, pci: 6.3-rc1-4 hangs during boot on PowerEdge R620 with igb

From: Rafael J. Wysocki
Date: Thu Apr 20 2023 - 11:32:58 EST


On Wed, Apr 19, 2023 at 9:34 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Wed, Apr 12, 2023 at 04:20:33PM +0300, Andy Shevchenko wrote:
> > On Tue, Apr 11, 2023 at 02:02:03PM -0500, Rob Herring wrote:
> > > On Tue, Apr 11, 2023 at 7:53 AM Donald Hunter <donald.hunter@xxxxxxxxx> wrote:
> > > > Bjorn Helgaas <helgaas@xxxxxxxxxx> writes:
> > > > > On Mon, Apr 10, 2023 at 04:10:54PM +0100, Donald Hunter wrote:
> > > > >> On Sun, 2 Apr 2023 at 23:55, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > > >> > On Sat, Apr 01, 2023 at 01:52:25PM +0100, Donald Hunter wrote:
> > > > >> > > On Fri, 31 Mar 2023 at 20:42, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > > >> > > >
> > > > >> > > > I assume this igb NIC (07:00.0) must be built-in (not a plug-in card)
> > > > >> > > > because it apparently has an ACPI firmware node, and there's something
> > > > >> > > > we don't expect about its status?
> > > > >> > >
> > > > >> > > Yes they are built-in, to my knowledge.
> > > > >> > >
> > > > >> > > > Hopefully Rob will look at this. If I were looking, I would be
> > > > >> > > > interested in acpidump to see what's in the DSDT.
> > > > >> > >
> > > > >> > > I can get an acpidump. Is there a preferred way to share the files, or just
> > > > >> > > an email attachment?
> > > > >> >
> > > > >> > I think by default acpidump produces ASCII that can be directly
> > > > >> > included in email. http://vger.kernel.org/majordomo-info.html says
> > > > >> > 100K is the limit for vger mailing lists. Or you could open a report
> > > > >> > at https://bugzilla.kernel.org and attach it there, maybe along with a
> > > > >> > complete dmesg log and "sudo lspci -vv" output.
> > > > >>
> > > > >> Apologies for the delay, I was unable to access the machine while travelling.
> > > > >>
> > > > >> https://bugzilla.kernel.org/show_bug.cgi?id=217317
> > > > >
> > > > > Thanks for that! Can you boot a kernel with 6fffbc7ae137 reverted
> > > > > with this in the kernel parameters:
> > > > >
> > > > > dyndbg="file drivers/acpi/* +p"
> > > > >
> > > > > and collect the entire dmesg log?
> > > >
> > > > Added to the bugzilla report.
> > >
> > > Rafael, Andy, Any ideas why fwnode_device_is_available() would return
> > > false for a built-in PCI device with a ACPI device entry? The only
> > > thing I see in the log is it looks like the parent PCI bridge/bus
> > > doesn't have ACPI device entry (based on "[ 0.913389] pci_bus
> > > 0000:07: No ACPI support"). For DT, if the parent doesn't have a node,
> > > then the child can't. Not sure on ACPI.
> >
> > Thanks for the Cc'ing. I haven't checked anything yet, but from the above it
> > sounds like a BIOS issue. If PCI has no ACPI companion tree, then why the heck
> > one of the devices has the entry? I'm not even sure this is allowed by ACPI
> > specification, but as I said, I just solely used the above mail.
>
> ACPI r6.5, sec 6.3.7, about _STA says:
>
> - Bit [0] - Set if the device is present.
> - Bit [1] - Set if the device is enabled and decoding its resources.
> - Bit [3] - Set if the device is functioning properly (cleared if
> device failed its diagnostics).
>
> ...
>
> If a device is present on an enumerable bus, then _STA must not
> return 0. In that case, bit[0] must be set and if the status of the
> device can be determined through a bus-specific enumeration and
> discovery mechanism, it must be reflected by the values of bit[1]
> and bit[3], even though the OSPM is not required to take them into
> account.
>
> Since PCI *is* an enumerable bus, I don't think we can use _STA to
> decide whether a PCI device is present.

You are right, _STA can't be used for that.

> We can use _STA to decide whether a host bridge is present, of course,
> but that doesn't help here because the host bridge in question is
> PNP0A08:00 that leads to [bus 00-3d], and it is present.
>
> I don't know exactly what path led to the igb issue, but I don't think
> we need to figure that out. I think we just need to avoid the use of
> _STA in fwnode_device_is_available().

I agree. It is incorrect.

> 6fffbc7ae137 ("PCI: Honor firmware's device disabled status") appeared
> in v6.3-rc1, so I think we need to revert or fix it before v6.3, which
> will probably be tagged Sunday (and I'll be on vacation
> Friday-Monday).

Yes, please revert this one ASAP.

Cheers,
Rafael