Re: [PATCH] PCI/ASPM: Enable ASPM on external PCIe devices

From: Kai-Heng Feng
Date: Wed Jun 28 2023 - 04:58:58 EST


[+Cc Sasha and Intel Wired Lan]

On Wed, Jun 28, 2023 at 4:54 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Tue, Jun 27, 2023 at 04:35:25PM +0800, Kai-Heng Feng wrote:
> > On Fri, Jun 23, 2023 at 7:06 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > On Tue, Jun 20, 2023 at 01:36:59PM -0500, Limonciello, Mario wrote:
> > > > <snip>
> > > > > > A variety of Intel chipsets don't support lane width switching
> > > > > > or speed switching. When ASPM has been enabled on a dGPU,
> > > > > > these features are utilized and breakage ensues.
> > > > >
> > > > > Maybe this helps explain all the completely unmaintainable ASPM
> > > > > garbage in amdgpu and radeon.
> > > > >
> > > > > If these devices are broken, we need quirks for them.
> > > >
> > > > The problem is which device do you consider "broken"?
> > > > The dGPU that uses these features when the platform advertises ASPM
> > > > or the chipset which doesn't support the features that the device
> > > > uses when ASPM is active?
> > > >
> > > > With this problem I'm talking about the dGPU works fine on hosts
> > > > that support these features.
> > >
> > > Without more details about what's broken and when, I can't say. What
> > > I *think* is that a device that doesn't work per spec needs a quirk.
> > > Typically it's a device that advertises a capability that doesn't work
> > > correctly.
> >
> > Many silicon vendors use the same "PCI IP" and integrate it into their
> > own chip. That's one of the reason why the capability doesn't really
> > reflect what actually being support.
> > The vendors then send their private datasheet to system integrator so
> > BIOS can enable what's really supported.
>
> It's perfectly fine for the IP to support PCI features that are not
> and can not be enabled in a system design. But I expect that
> strapping or firmware would disable those features so they are not
> advertised in config space.
>
> If BIOS leaves features disabled because they cannot work, but at the
> same time leaves them advertised in config space, I'd say that's a
> BIOS defect. In that case, we should have a DMI quirk or something to
> work around the defect.

That means most if not all BIOS are defected.
BIOS vendors and ODM never bothered (and probably will not) to change
the capabilities advertised by config space because "it already works
under Windows".

>
> > So the logic is to ignore the capability and trust the default set
> > by BIOS.
>
> I think limiting ASPM support to whatever BIOS configured at boot-time
> is problematic. I don't think we can assume that all platforms have
> firmware that configures ASPM as aggressively as possible, and
> obviously firmware won't configure hot-added devices at all (in
> general; I know ACPI _HPX can do some of that).

Totally agree. I was not suggesting to limiting the setting at all.
A boot-time parameter to flip ASPM setting is very useful. If none has
been set, default to BIOS setting.

>
> It's possible we need some kind of policy that limits ASPM to the BIOS
> config until date X, but I don't want to limit that forever.

Desktop can have very different PCIe cards in the slots so BIOS date
isn't a good indicator for ASPM support.
Let alone BIOS date changes on each BIOS upgrade, this means potential
regression risk.

>
> > > > If you follow my idea of hot added devices the policy follows
> > > > the parent would it work for the i225 PCIe device case?
> > >
> > > That doesn't *sound* really robust to me because even if the default
> > > config after hot-add works, the user can change things via sysfs, and
> > > any configuration we set it to should work as well. If there are
> > > land-mines there, we need a quirk that prevents sysfs from running
> > > into it.
> >
> > For this case it means driver needs to provide a ASPM callback to flip
> > PTM based on ASPM sysfs.
>
> I'm not sure I follow this, but it sounds like you're saying PTM
> doesn't work correctly with some ASPM configurations. Is this allowed
> by the PCIe spec or is it a hardware defect? If the latter, maybe we
> just need a quirk to disable PTM on the device.

I'll leave this one to Intel folks.

>
> I'm not sure PTM is valuable enough to add a generic callback
> mechanism to work around a defect in an individual device.

Agree. Right now the issue is only observed when I225 is in a TBT dock.

Kai-Heng

>
> Bjorn