Re: [PATCH] PCI/ASPM: Enable ASPM on external PCIe devices

From: Bjorn Helgaas
Date: Mon Jun 19 2023 - 17:37:24 EST


On Mon, Jun 19, 2023 at 11:16:35AM -0500, Limonciello, Mario wrote:
> On 6/15/2023 10:01 PM, Kai-Heng Feng wrote:
> > On Fri, Jun 16, 2023 at 1:12 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > On Thu, Jun 15, 2023 at 03:04:20PM +0800, Kai-Heng Feng wrote:
> > > > When a PCIe device is hotplugged to a Thunderbolt port, ASPM is not
> > > > enabled for that device. However, when the device is plugged preboot,
> > > > ASPM is enabled by default.
> > > >
> > > > The disparity happens because BIOS doesn't have the ability to program
> > > > ASPM on hotplugged devices.
> > > >
> > > > So enable ASPM by default for external connected PCIe devices so ASPM
> > > > settings are consitent between preboot and hotplugged.
> > > >
> > > > On HP Thunderbolt Dock G4, enable ASPM can also fix BadDLLP error:
> > > > pcieport 0000:00:1d.0: AER: Corrected error received: 0000:07:04.0
> > > > pcieport 0000:07:04.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
> > > > pcieport 0000:07:04.0: device [8086:0b26] error status/mask=00000080/00002000
> > > > pcieport 0000:07:04.0: [ 7] BadDLLP
> > > >
> > > > The root cause is still unclear, but quite likely because the I225 on
> > > > the dock supports PTM, where ASPM timing is precalculated for the PTM.
> > > >
> > > > Cc: Mario Limonciello <mario.limonciello@xxxxxxx>
> > > > Cc: Mika Westerberg <mika.westerberg@xxxxxxxxxxxxxxx>
> > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=217557
> > > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
> > > > ---
> > > > drivers/pci/pcie/aspm.c | 4 +++-
> > > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > index 66d7514ca111..613b0754c9bb 100644
> > > > --- a/drivers/pci/pcie/aspm.c
> > > > +++ b/drivers/pci/pcie/aspm.c
> > > > @@ -119,7 +119,9 @@ static int policy_to_aspm_state(struct pcie_link_state *link)
> > > > /* Enable Everything */
> > > > return ASPM_STATE_ALL;
> > > > case POLICY_DEFAULT:
> > > > - return link->aspm_default;
> > > > + return dev_is_removable(&link->downstream->dev) ?
> > > > + link->aspm_capable :
> > > > + link->aspm_default;
> > >
> > > I'm a little hesitant because dev_is_removable() is a convenient
> > > test that covers the current issue, but it doesn't seem tightly
> > > connected from a PCIe architecture perspective.
> > >
> > > I think the current model of compile-time ASPM policy selection:
> > >
> > > CONFIG_PCIEASPM_DEFAULT /* BIOS default setting */
> > > CONFIG_PCIEASPM_PERFORMANCE /* disable L0s and L1 */
> > > CONFIG_PCIEASPM_POWERSAVE /* enable L0s and L1 */
> > > CONFIG_PCIEASPM_POWER_SUPERSAVE /* enable L1 substates */
> > >
> > > is flawed. As far as I know, there's no technical reason we
> > > have to select this at kernel build-time. I suspect the
> > > original reason was risk avoidance, i.e., we were worried that
> > > we might expose hardware defects if we enabled ASPM states that
> > > BIOS hadn't already enabled.
> > >
> > > How do we get out of that model? We do have sysfs knobs that
> > > should cover all the functionality (set overall policy as above
> > > via /sys/module/pcie_aspm/parameters/policy; set device-level
> > > exceptions via /sys/bus/pci/devices/.../link/*_aspm).
> >
> > Agree. Build-time policy can be obsoleted by boot-time argument.
>
> I agree as well.

This isn't strictly relevant to the current problem, so let's put this
on the back burner for now.

> > > In my opinion, the cleanest solution would be to enable all ASPM
> > > functionality whenever possible and let users disable it if they
> > > need to for performance. If there are device defects when
> > > something is enabled, deal with it via quirks, as we do for
> > > other PCI features.
> > >
> > > That feels a little risky, but let's have a conversation about
> > > where we want to go in the long term. It's good to avoid risk,
> > > but too much avoidance leads to its own complexity and an
> > > inability to change things.
> >
> > I think we should separate the situation into two cases:
> > - When BIOS/system firmware has the ability to program ASPM, honor
> > it. This applies to most "internal" PCI devices.
> > - When BIOS/system can't program ASPM, enable ASPM for whatever
> > it's capable of. Most notable case is Intel VMD controller, and
> > this patch for devices connected through TBT.
> >
> > Enabling all ASPM functionality regardless of what's being
> > pre-programmed by BIOS is way too risky. Disabling ASPM to
> > workaround issues and defects are still quite common among
> > hardware manufacturers.

It sounds like you have actual experience with this :) Do you have
any concrete examples that we can use as "known breakage"?

This feels like a real problem to me. There are existing mechanisms
(ACPI_FADT_NO_ASPM and _OSC PCIe cap ownership) the platform can use
to prevent the OS from using ASPM.

If vendors assume that *in addition*, the OS will pay attention to
whatever ASPM configuration BIOS did, that's a major disconnect. We
don't do anything like that for other PCI features, and I'm not aware
of any restriction like that being documented.

> I think the pragmatic way to approach it is to (essentially) apply
> the policy as BIOS defaults and allow overrides from that.

Do you mean that when enumerating a device (at boot-time or hot-add
time), we would read the current ASPM config but not change it? And
users could use the sysfs knobs to enable/disable ASPM as desired?
That wouldn't solve the problem Kai-Heng is trying to solve.

Or that we leave ASPM alone during boot-time enumeration, but enable
ASPM when we enumerate hot-added devices? It doesn't sound right that
a device would be configured differently if present at boot vs
hot-added.

Bjorn