Re: [PATCH] PCI/ASPM: Enable ASPM on external PCIe devices

From: Bjorn Helgaas
Date: Tue Jun 27 2023 - 16:54:43 EST


On Tue, Jun 27, 2023 at 04:35:25PM +0800, Kai-Heng Feng wrote:
> On Fri, Jun 23, 2023 at 7:06 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > On Tue, Jun 20, 2023 at 01:36:59PM -0500, Limonciello, Mario wrote:
> > > <snip>
> > > > > A variety of Intel chipsets don't support lane width switching
> > > > > or speed switching. When ASPM has been enabled on a dGPU,
> > > > > these features are utilized and breakage ensues.
> > > >
> > > > Maybe this helps explain all the completely unmaintainable ASPM
> > > > garbage in amdgpu and radeon.
> > > >
> > > > If these devices are broken, we need quirks for them.
> > >
> > > The problem is which device do you consider "broken"?
> > > The dGPU that uses these features when the platform advertises ASPM
> > > or the chipset which doesn't support the features that the device
> > > uses when ASPM is active?
> > >
> > > With this problem I'm talking about the dGPU works fine on hosts
> > > that support these features.
> >
> > Without more details about what's broken and when, I can't say. What
> > I *think* is that a device that doesn't work per spec needs a quirk.
> > Typically it's a device that advertises a capability that doesn't work
> > correctly.
>
> Many silicon vendors use the same "PCI IP" and integrate it into their
> own chip. That's one of the reason why the capability doesn't really
> reflect what actually being support.
> The vendors then send their private datasheet to system integrator so
> BIOS can enable what's really supported.

It's perfectly fine for the IP to support PCI features that are not
and can not be enabled in a system design. But I expect that
strapping or firmware would disable those features so they are not
advertised in config space.

If BIOS leaves features disabled because they cannot work, but at the
same time leaves them advertised in config space, I'd say that's a
BIOS defect. In that case, we should have a DMI quirk or something to
work around the defect.

> So the logic is to ignore the capability and trust the default set
> by BIOS.

I think limiting ASPM support to whatever BIOS configured at boot-time
is problematic. I don't think we can assume that all platforms have
firmware that configures ASPM as aggressively as possible, and
obviously firmware won't configure hot-added devices at all (in
general; I know ACPI _HPX can do some of that).

It's possible we need some kind of policy that limits ASPM to the BIOS
config until date X, but I don't want to limit that forever.

> > > If you follow my idea of hot added devices the policy follows
> > > the parent would it work for the i225 PCIe device case?
> >
> > That doesn't *sound* really robust to me because even if the default
> > config after hot-add works, the user can change things via sysfs, and
> > any configuration we set it to should work as well. If there are
> > land-mines there, we need a quirk that prevents sysfs from running
> > into it.
>
> For this case it means driver needs to provide a ASPM callback to flip
> PTM based on ASPM sysfs.

I'm not sure I follow this, but it sounds like you're saying PTM
doesn't work correctly with some ASPM configurations. Is this allowed
by the PCIe spec or is it a hardware defect? If the latter, maybe we
just need a quirk to disable PTM on the device.

I'm not sure PTM is valuable enough to add a generic callback
mechanism to work around a defect in an individual device.

Bjorn