Re: [PATCH 2/2] PCI: vmd: Enable ASPM for mobile platforms

From: Ian Kumlien
Date: Fri Oct 09 2020 - 10:34:44 EST


On Thu, Oct 8, 2020 at 6:19 AM Kai-Heng Feng
<kai.heng.feng@xxxxxxxxxxxxx> wrote:
>
>
>
> > On Oct 7, 2020, at 21:30, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> >
> > On Wed, Oct 07, 2020 at 12:26:19PM +0800, Kai-Heng Feng wrote:
> >>> On Oct 6, 2020, at 03:19, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> >>> On Tue, Oct 06, 2020 at 02:40:32AM +0800, Kai-Heng Feng wrote:
> >>>>> On Oct 3, 2020, at 06:18, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> >>>>> On Wed, Sep 30, 2020 at 04:24:54PM +0800, Kai-Heng Feng wrote:
> >
> > ...
> >>>> I wonder whether other devices that add PCIe domain have the same
> >>>> behavior? Maybe it's not a special case at all...
> >>>
> >>> What other devices are these?
> >>
> >> Controllers which add PCIe domain.
> >
> > I was looking for specific examples, not just a restatement of what
> > you said before. I'm just curious because there are a lot of
> > controllers I'm not familiar with, and I can't think of an example.
> >
> >>>> I understand the end goal is to keep consistency for the entire ASPM
> >>>> logic. However I can't think of any possible solution right now.
> >>>>
> >>>>> - If we built with CONFIG_PCIEASPM_POWERSAVE=y, would that solve the
> >>>>> SoC power state problem?
> >>>>
> >>>> Yes.
> >>>>
> >>>>> - What issues would CONFIG_PCIEASPM_POWERSAVE=y introduce?
> >>>>
> >>>> This will break many systems, at least for the 1st Gen Ryzen
> >>>> desktops and laptops.
> >>>>
> >>>> All PCIe ASPM are not enabled by BIOS, and those systems immediately
> >>>> freeze once ASPM is enabled.
> >>>
> >>> That indicates a defect in the Linux ASPM code. We should fix that.
> >>> It should be safe to use CONFIG_PCIEASPM_POWERSAVE=y on every system.
> >>
> >> On those systems ASPM are also not enabled on Windows. So I think
> >> ASPM are disabled for a reason.
> >
> > If the platform knows ASPM needs to be disabled, it should be using
> > ACPI_FADT_NO_ASPM or _OSC to prevent the OS from using it. And if
> > CONFIG_PCIEASPM_POWERSAVE=y means Linux enables ASPM when it
> > shouldn't, that's a Linux bug that we need to fix.
>
> Yes that's a bug which fixed by Ian's new patch.
>
> >
> >>> Are there bug reports for these? The info we would need to start with
> >>> includes "lspci -vv" and dmesg log (with CONFIG_PCIEASPM_DEFAULT=y).
> >>> If a console log with CONFIG_PCIEASPM_POWERSAVE=y is available, that
> >>> might be interesting, too. We'll likely need to add some
> >>> instrumentation and do some experimentation, but in principle, this
> >>> should be fixable.
> >>
> >> Doing this is asking users to use hardware settings that ODM/OEM
> >> never tested, and I think the risk is really high.
> >
> > What? That's not what I said at all. I'm asking for information
> > about these hangs so we can fix them. I'm not suggesting that you
> > should switch to CONFIG_PCIEASPM_POWERSAVE=y for the distro.
>
> Ah, I thought your suggestion is switching to CONFIG_PCIEASPM_POWERSAVE=y, because I sense you want to use that to cover the VMD ASPM this patch tries to solve.
>
> Do we have a conclusion how to enable VMD ASPM with CONFIG_PCIEASPM_DEFAULT=y?
>
> >
> > Let's back up. You said:
> >
> > CONFIG_PCIEASPM_POWERSAVE=y ... will break many systems, at least
> > for the 1st Gen Ryzen desktops and laptops.
> >
> > All PCIe ASPM are not enabled by BIOS, and those systems immediately
> > freeze once ASPM is enabled.
> >
> > These system hangs might be caused by (1) some hardware issue that
> > causes a hang when ASPM is enabled even if it is configured correctly
> > or (2) Linux configuring ASPM incorrectly.
>
> It's (2) here.
>
> >
> > For case (1), the platform should be using ACPI_FADT_NO_ASPM or _OSC
> > to prevent the OS from enabling ASPM. Linux should pay attention to
> > that even when CONFIG_PCIEASPM_POWERSAVE=y.
> >
> > If the platform *should* use these mechanisms but doesn't, the
> > solution is a quirk, not the folklore that "we can't use
> > CONFIG_PCIEASPM_POWERSAVE=y because it breaks some systems."
>
> The platform in question doesn't prevent OS from enabling ASPM.
>
> >
> > For case (2), we should fix Linux so it configures ASPM correctly.
> >
> > We cannot use the build-time CONFIG_PCIEASPM settings to avoid these
> > hangs. We need to fix the Linux run-time code so the system operates
> > correctly no matter what CONFIG_PCIEASPM setting is used.
> >
> > We have sysfs knobs to control ASPM (see 72ea91afbfb0 ("PCI/ASPM: Add
> > sysfs attributes for controlling ASPM link states")). They can do the
> > same thing at run-time as CONFIG_PCIEASPM_POWERSAVE=y does at
> > build-time. If those knobs cause hangs on 1st Gen Ryzen systems, we
> > need to fix that.
>
> Ian's patch solves the issue, at least for the systems I have.

Could you add:
diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index 15d64832a988..cd9f2101f9a2 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -482,7 +482,12 @@ static void pcie_aspm_check_latency(struct
pci_dev *endpoint)
latency = max_t(u32, link->latency_up.l1,
link->latency_dw.l1);
l1_max_latency = max_t(u32, latency, l1_max_latency);
if (l1_max_latency + l1_switch_latency > acceptable->l1)
+ {
+ pci_info(endpoint, "L1 latency
exceeded - path: %i - max: %i\n", l1_switch_latency, l1_max_latency);
+ pci_info(link->pdev, "Upstream device
- %i\n", link->latency_up.l1);
+ pci_info(link->downstream, "Downstream
device - %i\n", link->latency_dw.l1);
link->aspm_capable &= ~ASPM_STATE_L1;
+ }

l1_switch_latency += 1000;
}

So we can see what device triggers what links to be disabled?

I think your use-case is much more important than mine - mine fixes
something as a side effect

Also, please send me the lspci -vvv output as well as lspci -PP -s
<device id> for the device id:s mentioned in dmesg with the patch
above applied ;)

> Kai-Heng
>
> >
> > Bjorn
>