Re: [PATCH] thunderbolt: Resume PCIe bridges after switch is found on AMD USB4 controller

From: Kai-Heng Feng
Date: Thu Sep 08 2022 - 10:03:00 EST


"

On Thu, Sep 8, 2022 at 12:30 AM Limonciello, Mario
<Mario.Limonciello@xxxxxxx> wrote:
>
> [Public]
>
> Hi,
>
> > -----Original Message-----
> > From: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx>
> > Sent: Monday, September 5, 2022 02:30
> > To: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
> > Cc: mika.westerberg@xxxxxxxxxxxxxxx; andreas.noever@xxxxxxxxx;
> > michael.jamet@xxxxxxxxx; YehezkelShB@xxxxxxxxx; Mehta, Sanju
> > <Sanju.Mehta@xxxxxxx>; Limonciello, Mario
> > <Mario.Limonciello@xxxxxxx>; linux-usb@xxxxxxxxxxxxxxx; linux-
> > kernel@xxxxxxxxxxxxxxx
> > Subject: Re: [PATCH] thunderbolt: Resume PCIe bridges after switch is found
> > on AMD USB4 controller
> >
> > On Mon, Sep 05, 2022 at 02:56:22PM +0800, Kai-Heng Feng wrote:
> > > AMD USB4 can not detect external PCIe devices like external NVMe when
> > > it's hotplugged, because card/link are not up:
> > >
> > > pcieport 0000:00:04.1: pciehp: pciehp_check_link_active: lnk_status = 1101
> >
> > That sounds like a hardware bug, how does this work in other operating
> > systems for this hardware?
>
> We happen to have this HP system in our lab. My colleague Anson (now on CC) flashed
> the same BIOS to it (01.02.01) using dediprog and loaded a 6.0-rc3 mainline kernel built
> from the Canonical mainline kernel PPA.
>
> He then tried to hotplug a TBT3 SSD a number of times but couldn't hit this issue.
> I attached his log to the kernel Bugzilla.

Nice to hear. Hopefully this can be fixed at firmware/hardware side.

>
> >
> > > Use `lspci` to resume pciehp bridges can find external devices.
> >
> > That's not good :(
> >
> > > A long delay before checking card/link presence doesn't help, either.
> > > The only way to make the hotplug work is to enable pciehp interrupt and
> > > check card presence after the TB switch is added.
> > >
> > > Since the topology of USB4 and its PCIe bridges are siblings, hardcode
> > > the bridge ID so TBT driver can wake them up to check presence.
> >
> > As I mention below, this is not an acceptable solution.
> >
> > AMD developers, any ideas on how to get this fixed in the TB controller
> > firware instead?
>
> Anson also double checked on the AMD reference hardware that the HP system is built
> against and couldn't reproduce it there either.
>
> KH, I've got a few questions/comments to try to better explain why we're here.
>
> 1) How did you flash the 01.02.01 firmware? In Anson's check, he used dediprog.
> Is it possible there was some stateful stuff used by HP's BIOS still on the SPI from the
> upgrade that didn't get set/cleared properly from an earlier pre-release BIOS?

We used UEFI capsule to update the firmware, via fwupd.

>
> 2) Did you change any BIOS settings? Particularly anything to do with Pre-OS CM?

No, nothing in BIOS was changed.

>
> 3) If you explicitly reset to HP's "default BIOS settings" does it resolve?

Doesn't help. I put the device to ACPI G3 and it doesn't help, either.

>
> 4) Can you double check ADP_CS_5 bit 31? I attached is a patch to kernel Bugzilla to
> add dyndbg output for it. If it was for some reason set by Pre-OS CM in your BIOS/settings
> combination, we might need to undo it by the Linux CM.

All ports say "Hotplug disabled: 0".

dmesg attached to the bugzilla.

>
> 5) Are you changing any of the default runtime PM policies for any of the USB4 routers or
> root ports used for tunneling using software like TLP?

No. And they should be suspended by default.

Kai-Heng

>
> >
> > >
> > > Bugzilla:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugz
> > illa.kernel.org%2Fshow_bug.cgi%3Fid%3D216448&amp;data=05%7C01%7Cm
> > ario.limonciello%40amd.com%7C1e27b1d6f69e42796c7b08da8f107121%7C3d
> > d8961fe4884e608e11a82d994e183d%7C0%7C0%7C637979598042186185%7CU
> > nknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI
> > 6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=0lhcaKfUyoK
> > 0FXT9uDZ8a%2Fpxs9tHd8aoQcyPFdB%2F0eY%3D&amp;reserved=0
> > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
> > > ---
> > > drivers/thunderbolt/nhi.c | 29 +++++++++++++++++++++++++++++
> > > drivers/thunderbolt/switch.c | 6 ++++++
> > > drivers/thunderbolt/tb.c | 1 +
> > > drivers/thunderbolt/tb.h | 5 +++++
> > > include/linux/thunderbolt.h | 1 +
> > > 5 files changed, 42 insertions(+)
> > >
> > > diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
> > > index cb8c9c4ae93a2..75f5ce5e22978 100644
> > > --- a/drivers/thunderbolt/nhi.c
> > > +++ b/drivers/thunderbolt/nhi.c
> > > @@ -1225,6 +1225,8 @@ static int nhi_probe(struct pci_dev *pdev, const
> > struct pci_device_id *id)
> > > {
> > > struct tb_nhi *nhi;
> > > struct tb *tb;
> > > + struct pci_dev *p = NULL;
> > > + struct tb_pci_bridge *pci_bridge, *n;
> > > int res;
> > >
> > > if (!nhi_imr_valid(pdev)) {
> > > @@ -1306,6 +1308,19 @@ static int nhi_probe(struct pci_dev *pdev, const
> > struct pci_device_id *id)
> > > nhi_shutdown(nhi);
> > > return res;
> > > }
> > > +
> > > + if (pdev->vendor == PCI_VENDOR_ID_AMD) {
> > > + while ((p = pci_get_device(PCI_VENDOR_ID_AMD, 0x14cd,
> > p))) {
> > > + pci_bridge = kmalloc(sizeof(struct tb_pci_bridge),
> > GFP_KERNEL);
> > > + if (!pci_bridge)
> > > + goto cleanup;
> > > +
> > > + pci_bridge->bridge = p;
> > > + INIT_LIST_HEAD(&pci_bridge->list);
> > > + list_add(&pci_bridge->list, &tb->bridge_list);
> > > + }
> > > + }
> >
> > You can't walk the device tree and create a "shadow" list of devices
> > like this and expect any lifetime rules to work properly with them at
> > all.
> >
> > Please do not do this.
> >
> > greg k-h