Re: [PATCH] PCI/ASPM: Enable ASPM on external PCIe devices

From: Limonciello, Mario
Date: Tue Jun 20 2023 - 14:37:09 EST


<snip>
A variety of Intel chipsets don't support lane width switching
or speed switching.  When ASPM has been enabled on a dGPU,
these features are utilized and breakage ensues.
Maybe this helps explain all the completely unmaintainable ASPM
garbage in amdgpu and radeon.

If these devices are broken, we need quirks for them.

The problem is which device do you consider "broken"?
The dGPU that uses these features when the platform advertises ASPM
or the chipset which doesn't support the features that the device
uses when ASPM is active?

With this problem I'm talking about the dGPU works fine on hosts
that support these features.

KH has a lot more experience with ASPM issues and hopefully has some
other examples to share.

We can't avoid
ASPM in general just because random devices break.

I'm not advocating to avoid it in general, I'm saying we shouldn't
be turning it on across the board for all devices if the platform had
it off initially via a kernel command line option or Kconfig.

There are various methods to try to mitigate the impact both in
firmware and driver code.

This feels like a real problem to me. There are existing mechanisms
(ACPI_FADT_NO_ASPM and _OSC PCIe cap ownership) the platform can use
to prevent the OS from using ASPM.

If vendors assume that *in addition*, the OS will pay attention to
whatever ASPM configuration BIOS did, that's a major disconnect. We
don't do anything like that for other PCI features, and I'm not aware
of any restriction like that being documented.
With both of those policies in place, how did we get into
the situation of having configuration options and knobs?
The kernel parameters and config options been there pretty much from
the beginning. We didn't have the per-device sysfs knobs until very
recently.
Ah, I see.

I think the pragmatic way to approach it is to (essentially) apply
the policy as BIOS defaults and allow overrides from that.
Do you mean that when enumerating a device (at boot-time or hot-add
time), we would read the current ASPM config but not change it? And
users could use the sysfs knobs to enable/disable ASPM as desired?
Yes.
Hot-added devices power up with ASPM disabled. This policy would mean
the user has to explicitly enable it, which doesn't seem practical to
me.
Could we maybe have the hot added devices follow the policy of
the bridge they're connected to by default?

That wouldn't solve the problem Kai-Heng is trying to solve.
Alone it wouldn't; but if you treated the i225 PCIe device
connected to the system as a "quirk" to apply ASPM policy
from the parent device to this child device it could.
I want quirks for BROKEN devices. Quirks for working hardware is a
maintenance nightmare.

Bjorn
If you follow my idea of hot added devices the policy follows
the parent would it work for the i225 PCIe device case?