Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle

From: Keith Busch
Date: Fri May 10 2019 - 11:43:48 EST


On Fri, May 10, 2019 at 11:15:05PM +0800, Kai Heng Feng wrote:
> Sorry, I should mention that I use a slightly modified drivers/nvme/host/pci.c:
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 3e4fb891a95a..ece428ce6876 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -18,6 +18,7 @@
> #include <linux/mutex.h>
> #include <linux/once.h>
> #include <linux/pci.h>
> +#include <linux/suspend.h>
> #include <linux/t10-pi.h>
> #include <linux/types.h>
> #include <linux/io-64-nonatomic-lo-hi.h>
> @@ -2833,6 +2834,11 @@ static int nvme_suspend(struct device *dev)
> struct pci_dev *pdev = to_pci_dev(dev);
> struct nvme_dev *ndev = pci_get_drvdata(pdev);
>
> + if (!pm_suspend_via_firmware()) {
> + nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);
> + pci_save_state(pdev);
> + }
> +
> nvme_dev_disable(ndev, true);

This won't work because you're falling through to nvme_dev_disable after
setting the power, so the resume side would certainly fail.

This is what you'd want:

if (!pm_suspend_via_firmware()) {
pci_save_state(pdev);
return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);
}

> return 0;
> }
> @@ -2842,6 +2848,10 @@ static int nvme_resume(struct device *dev)
> struct pci_dev *pdev = to_pci_dev(dev);
> struct nvme_dev *ndev = pci_get_drvdata(pdev);
>
> + if (!pm_resume_via_firmware()) {
> + return nvme_set_power(&ndev->ctrl, 0);
> + }
> +
> nvme_reset_ctrl(&ndev->ctrl);
> return 0;
> }
>
> Does pci_save_state() here enough to prevent the device enter to D3?

Yes, saving the state during suspend will prevent pci pm from assuming
control over this device's power. It's a bit non-intuitive as Christoph
mentioned, so we'll need to make that clear in the comments. For
reference, here's the relevant part in pci-driver.c:

---
static int pci_pm_suspend_noirq(struct device *dev)
{
...

if (!pci_dev->state_saved) {
pci_save_state(pci_dev);
if (pci_power_manageable(pci_dev))
pci_prepare_to_sleep(pci_dev);
}
...
}
--

So by saving the state in our suspend callback, pci will skip
pci_prepare_to_sleep(), which is what's setting your device most likely
to a D3hot, undermining our nvme power state setting.