Re: [PATCH v3 1/2] PCI: dwc: Implement general suspend/resume functionality for L2/L3 transitions

From: Manivannan Sadhasivam
Date: Fri Jul 21 2023 - 12:08:18 EST


On Fri, Jul 21, 2023 at 10:10:11AM -0400, Frank Li wrote:
> On Fri, Jul 21, 2023 at 10:09:18AM +0800, Shawn Lin wrote:
> >
> > On 2023/7/21 0:07, Manivannan Sadhasivam wrote:
> > > On Thu, Jul 20, 2023 at 10:37:36AM -0400, Frank Li wrote:
> > > > On Thu, Jul 20, 2023 at 07:55:09PM +0530, Manivannan Sadhasivam wrote:
> > > > > On Tue, Jul 18, 2023 at 03:34:26PM +0530, Manivannan Sadhasivam wrote:
> > > > > > On Mon, Jul 17, 2023 at 02:36:19PM -0400, Frank Li wrote:
> > > > > > > On Mon, Jul 17, 2023 at 10:15:26PM +0530, Manivannan Sadhasivam wrote:
> > > > > > > > On Wed, Apr 19, 2023 at 12:41:17PM -0400, Frank Li wrote:
> > > > > > > > > Introduced helper function dw_pcie_get_ltssm to retrieve SMLH_LTSS_STATE.
> > > > > > > > > Added API pme_turn_off and exit_from_l2 for managing L2/L3 state transitions.
> > > > > > > > >
> > > > > > > > > Typical L2 entry workflow:
> > > > > > > > >
> > > > > > > > > 1. Transmit PME turn off signal to PCI devices.
> > > > > > > > > 2. Await link entering L2_IDLE state.
> > > > > > > >
> > > > > > > > AFAIK, typical workflow is to wait for PME_To_Ack.
> > > > > > >
> > > > > > > 1 Already wait for PME_to_ACK, 2, just wait for link actual enter L2.
> > > > > > > I think PCI RC needs some time to set link enter L2 after get ACK from
> > > > > > > PME.
> > > > > > >
> > > > >
> > > > > One more comment. If you transition the device to L2/L3, then it can loose power
> > > > > if Vaux was not provided. In that case, can all the devices work after resume?
> > > > > Most notably NVMe?
> > > >
> > > > I have not hardware to do such test, NVMe driver will reinit everything after
> > > > resume if no L1.1\L1.2 support. If there are L1.1\L1.2, NVME expect it leave
> > > > at L1.2 at suspend to get better resume latency.
> > > >
> > >
> > > To be precise, NVMe driver will shutdown the device if there is no ASPM support
> > > and keep it in low power mode otherwise (there are other cases as well but we do
> > > not need to worry).
> > >
> > > But here you are not checking for ASPM state in the suspend path, and just
> > > forcing the link to be in L2/L3 (thereby D3Cold) even though NVMe driver may
> > > expect it to be in low power state like ASPM/APST.
> > >
> > > So you should only put the link to L2/L3 if there is no ASPM support. Otherwise,
> > > you'll ending up with bug reports when users connect NVMe to it.
> > >
> >
> >
> > At this topic, it's very interesting to look at
> >
> > drivers/pci/controller/dwc/pcie-tegra194.c
> >
> >
> > static int tegra_pcie_dw_suspend_noirq(struct device *dev)
> > {
> > struct tegra_pcie_dw *pcie = dev_get_drvdata(dev);
> >
> > if (!pcie->link_state)
> > return 0;
> >
> > tegra_pcie_downstream_dev_to_D0(pcie);
> > tegra_pcie_dw_pme_turnoff(pcie);
> > tegra_pcie_unconfig_controller(pcie);
> >
> > return 0;
> > }
> >
> > It brings back all the downstream components to D0, as I assumed it was L0
> > indeed, before sending PME aiming to enter L2.
>
> If current state is L1.1 or L1.2, hardware can auto enter to D0\L0 when
> there are any PCI bus activity, include PME. I supposed
> tegra_pcie_downstream_dev_to_D0() just make sure come back from L2/L3,
> which may enter by runtime PM previously, or other reason.
>
> NVME ASPM problem is (at least when I debug at other platform about 1 year
> ago):
>
> 1. NVME will not release MSI interrupt during suspsend.
> 2. PCI controler enter L2 at suspned_noirq();
> 3. CPU hot plug try to down second core (CORE1, CORE2, ...)
> 4. GIC try to disable MSI irq by write config space.

Just for the record, this will only happen during deep sleep (s2ram) where the
CPUs are powered down (including the boot CPU).

> 5. panic here because config space can't be access at L2.
>
> I suposed tegra should have problem when ASPM enable with NVME devices.
>

NVMe suspend issue has several faces:

If NVMe is powered down during suspend, it will result in considerable power
savings. But at the same time, the suspend should not happen too frequently as
it may deteriorate the lifetime of the device. Most of the recent NVMe devices
have 2M power cycles (only).

We can workaround the above lifetime issue by powering down the device only
during s2ram. It will work great for Laptop use cases if s2ram is supported.
Unfortunately, not all Laptops support s2ram though. And if the device is
powered down during s2idle, it will hit the above life time issue when it is
used in Android platforms such as Tablets (even future mobile phones?) which
doesn't support s2ram.

So I'm thinking of the following options to address the issue(s):

1. Modify the NVMe driver to power down the device during s2ram (the driver can
use the global pm_suspend_target_state flag to detect the suspend state) and use
the same logic to put the link into L2/L3 state in the PCIe controller drivers.
For s2idle, maintain both the device and link in low power states.

2. Get the power management decision from the userspace and use that to decide
the power down logic in s2idle for both NVMe and PCIe drivers. This will ensure
that the NVMe device will be powered down in suitable usecases like Laptop
without s2ram and kept in low power states for Android platforms. But this is
quite an involved task and I don't know if it is possible at all.

I'm just dumping my thoughts here. And I plan to intiate a discussion with NVMe/
power folks soon. Maybe during Plumbers?

- Mani

> Frank
> >
> > > - Mani
> > >
> > > > This API help remove duplicate codes and it can be improved gradually.
> > > >
> > > >
> > > > >
> > > > > - Mani
> > > > >
> > > > >
> > > > > --
> > > > > மணிவண்ணன் சதாசிவம்
> > >

--
மணிவண்ணன் சதாசிவம்