RE: [PATCH] thunderbolt: Resume PCIe bridges after switch is found on AMD USB4 controller

From: Limonciello, Mario
Date: Wed Sep 07 2022 - 12:30:31 EST


[Public]

Hi,

> -----Original Message-----
> From: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx>
> Sent: Monday, September 5, 2022 02:30
> To: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
> Cc: mika.westerberg@xxxxxxxxxxxxxxx; andreas.noever@xxxxxxxxx;
> michael.jamet@xxxxxxxxx; YehezkelShB@xxxxxxxxx; Mehta, Sanju
> <Sanju.Mehta@xxxxxxx>; Limonciello, Mario
> <Mario.Limonciello@xxxxxxx>; linux-usb@xxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] thunderbolt: Resume PCIe bridges after switch is found
> on AMD USB4 controller
>
> On Mon, Sep 05, 2022 at 02:56:22PM +0800, Kai-Heng Feng wrote:
> > AMD USB4 can not detect external PCIe devices like external NVMe when
> > it's hotplugged, because card/link are not up:
> >
> > pcieport 0000:00:04.1: pciehp: pciehp_check_link_active: lnk_status = 1101
>
> That sounds like a hardware bug, how does this work in other operating
> systems for this hardware?

We happen to have this HP system in our lab. My colleague Anson (now on CC) flashed
the same BIOS to it (01.02.01) using dediprog and loaded a 6.0-rc3 mainline kernel built
from the Canonical mainline kernel PPA.

He then tried to hotplug a TBT3 SSD a number of times but couldn't hit this issue.
I attached his log to the kernel Bugzilla.

>
> > Use `lspci` to resume pciehp bridges can find external devices.
>
> That's not good :(
>
> > A long delay before checking card/link presence doesn't help, either.
> > The only way to make the hotplug work is to enable pciehp interrupt and
> > check card presence after the TB switch is added.
> >
> > Since the topology of USB4 and its PCIe bridges are siblings, hardcode
> > the bridge ID so TBT driver can wake them up to check presence.
>
> As I mention below, this is not an acceptable solution.
>
> AMD developers, any ideas on how to get this fixed in the TB controller
> firware instead?

Anson also double checked on the AMD reference hardware that the HP system is built
against and couldn't reproduce it there either.

KH, I've got a few questions/comments to try to better explain why we're here.

1) How did you flash the 01.02.01 firmware? In Anson's check, he used dediprog.
Is it possible there was some stateful stuff used by HP's BIOS still on the SPI from the
upgrade that didn't get set/cleared properly from an earlier pre-release BIOS?

2) Did you change any BIOS settings? Particularly anything to do with Pre-OS CM?

3) If you explicitly reset to HP's "default BIOS settings" does it resolve?

4) Can you double check ADP_CS_5 bit 31? I attached is a patch to kernel Bugzilla to
add dyndbg output for it. If it was for some reason set by Pre-OS CM in your BIOS/settings
combination, we might need to undo it by the Linux CM.

5) Are you changing any of the default runtime PM policies for any of the USB4 routers or
root ports used for tunneling using software like TLP?

>
> >
> > Bugzilla:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugz
> illa.kernel.org%2Fshow_bug.cgi%3Fid%3D216448&amp;data=05%7C01%7Cm
> ario.limonciello%40amd.com%7C1e27b1d6f69e42796c7b08da8f107121%7C3d
> d8961fe4884e608e11a82d994e183d%7C0%7C0%7C637979598042186185%7CU
> nknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI
> 6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=0lhcaKfUyoK
> 0FXT9uDZ8a%2Fpxs9tHd8aoQcyPFdB%2F0eY%3D&amp;reserved=0
> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
> > ---
> > drivers/thunderbolt/nhi.c | 29 +++++++++++++++++++++++++++++
> > drivers/thunderbolt/switch.c | 6 ++++++
> > drivers/thunderbolt/tb.c | 1 +
> > drivers/thunderbolt/tb.h | 5 +++++
> > include/linux/thunderbolt.h | 1 +
> > 5 files changed, 42 insertions(+)
> >
> > diff --git a/drivers/thunderbolt/nhi.c b/drivers/thunderbolt/nhi.c
> > index cb8c9c4ae93a2..75f5ce5e22978 100644
> > --- a/drivers/thunderbolt/nhi.c
> > +++ b/drivers/thunderbolt/nhi.c
> > @@ -1225,6 +1225,8 @@ static int nhi_probe(struct pci_dev *pdev, const
> struct pci_device_id *id)
> > {
> > struct tb_nhi *nhi;
> > struct tb *tb;
> > + struct pci_dev *p = NULL;
> > + struct tb_pci_bridge *pci_bridge, *n;
> > int res;
> >
> > if (!nhi_imr_valid(pdev)) {
> > @@ -1306,6 +1308,19 @@ static int nhi_probe(struct pci_dev *pdev, const
> struct pci_device_id *id)
> > nhi_shutdown(nhi);
> > return res;
> > }
> > +
> > + if (pdev->vendor == PCI_VENDOR_ID_AMD) {
> > + while ((p = pci_get_device(PCI_VENDOR_ID_AMD, 0x14cd,
> p))) {
> > + pci_bridge = kmalloc(sizeof(struct tb_pci_bridge),
> GFP_KERNEL);
> > + if (!pci_bridge)
> > + goto cleanup;
> > +
> > + pci_bridge->bridge = p;
> > + INIT_LIST_HEAD(&pci_bridge->list);
> > + list_add(&pci_bridge->list, &tb->bridge_list);
> > + }
> > + }
>
> You can't walk the device tree and create a "shadow" list of devices
> like this and expect any lifetime rules to work properly with them at
> all.
>
> Please do not do this.
>
> greg k-h