Re: [PATCH] PCI/ASPM: Enable ASPM on external PCIe devices

From: Kai-Heng Feng
Date: Tue Jun 27 2023 - 04:36:17 EST


On Fri, Jun 23, 2023 at 7:06 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Tue, Jun 20, 2023 at 01:36:59PM -0500, Limonciello, Mario wrote:
> > <snip>
> > > > A variety of Intel chipsets don't support lane width switching
> > > > or speed switching. When ASPM has been enabled on a dGPU,
> > > > these features are utilized and breakage ensues.
> > > Maybe this helps explain all the completely unmaintainable ASPM
> > > garbage in amdgpu and radeon.
> > >
> > > If these devices are broken, we need quirks for them.
> >
> > The problem is which device do you consider "broken"?
> > The dGPU that uses these features when the platform advertises ASPM
> > or the chipset which doesn't support the features that the device
> > uses when ASPM is active?
> >
> > With this problem I'm talking about the dGPU works fine on hosts
> > that support these features.
>
> Without more details about what's broken and when, I can't say. What
> I *think* is that a device that doesn't work per spec needs a quirk.
> Typically it's a device that advertises a capability that doesn't work
> correctly.

Many silicon vendors use the same "PCI IP" and integrate it into their
own chip. That's one of the reason why the capability doesn't really
reflect what actually being support.
The vendors then send their private datasheet to system integrator so
BIOS can enable what's really supported.

So the logic is to ignore the capability and trust the default set by BIOS.

>
> > > > > > I think the pragmatic way to approach it is to (essentially)
> > > > > > apply the policy as BIOS defaults and allow overrides from
> > > > > > that.
> > > > >
> > > > > Do you mean that when enumerating a device (at boot-time or
> > > > > hot-add time), we would read the current ASPM config but not
> > > > > change it? And users could use the sysfs knobs to
> > > > > enable/disable ASPM as desired?
> > > >
> > > > Yes.
> > > >
> > > Hot-added devices power up with ASPM disabled. This policy would
> > > mean the user has to explicitly enable it, which doesn't seem
> > > practical to me.
> >
> > Could we maybe have the hot added devices follow the policy of
> > the bridge they're connected to by default?
> >
> > > > > That wouldn't solve the problem Kai-Heng is trying to solve.
> > > >
> > > > Alone it wouldn't; but if you treated the i225 PCIe device
> > > > connected to the system as a "quirk" to apply ASPM policy
> > > > from the parent device to this child device it could.
> > >
> > > I want quirks for BROKEN devices. Quirks for working hardware is a
> > > maintenance nightmare.
> >
> > If you follow my idea of hot added devices the policy follows
> > the parent would it work for the i225 PCIe device case?
>
> That doesn't *sound* really robust to me because even if the default
> config after hot-add works, the user can change things via sysfs, and
> any configuration we set it to should work as well. If there are
> land-mines there, we need a quirk that prevents sysfs from running
> into it.

For this case it means driver needs to provide a ASPM callback to flip
PTM based on ASPM sysfs.

Kai-Heng

> Bjorn