Re: [RFC PATCH v2 1/5] vfio/pci: register vfio-pci driver with runtime PM framework

From: Alex Williamson
Date: Wed Feb 16 2022 - 18:48:17 EST


On Mon, 24 Jan 2022 23:47:22 +0530
Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:

> Currently, there is very limited power management support
> available in the upstream vfio-pci driver. If there is no user of vfio-pci
> device, then the PCI device will be moved into D3Hot state by writing
> directly into PCI PM registers. This D3Hot state help in saving power
> but we can achieve zero power consumption if we go into the D3cold state.
> The D3cold state cannot be possible with native PCI PM. It requires
> interaction with platform firmware which is system-specific.
> To go into low power states (including D3cold), the runtime PM framework
> can be used which internally interacts with PCI and platform firmware and
> puts the device into the lowest possible D-States.
>
> This patch registers vfio-pci driver with the runtime PM framework.
>
> 1. The PCI core framework takes care of most of the runtime PM
> related things. For enabling the runtime PM, the PCI driver needs to
> decrement the usage count and needs to register the runtime
> suspend/resume callbacks. For vfio-pci based driver, these callback
> routines can be stubbed in this patch since the vfio-pci driver
> is not doing the PCI device initialization. All the config state
> saving, and PCI power management related things will be done by
> PCI core framework itself inside its runtime suspend/resume callbacks.
>
> 2. Inside pci_reset_bus(), all the devices in bus/slot will be moved
> out of D0 state. This state change to D0 can happen directly without
> going through the runtime PM framework. So if runtime PM is enabled,
> then pm_runtime_resume() makes the runtime state active. Since the PCI
> device power state is already D0, so it should return early when it
> tries to change the state with pci_set_power_state(). Then
> pm_request_idle() can be used which will internally check for
> device usage count and will move the device again into the low power
> state.
>
> 3. Inside vfio_pci_core_disable(), the device usage count always needs
> to be decremented which was incremented in vfio_pci_core_enable().
>
> 4. Since the runtime PM framework will provide the same functionality,
> so directly writing into PCI PM config register can be replaced with
> the use of runtime PM routines. Also, the use of runtime PM can help
> us in more power saving.
>
> In the systems which do not support D3Cold,
>
> With the existing implementation:
>
> // PCI device
> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
> D3hot
> // upstream bridge
> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
> D0
>
> With runtime PM:
>
> // PCI device
> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
> D3hot
> // upstream bridge
> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
> D3hot
>
> So, with runtime PM, the upstream bridge or root port will also go
> into lower power state which is not possible with existing
> implementation.
>
> In the systems which support D3Cold,
>
> // PCI device
> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
> D3hot
> // upstream bridge
> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
> D0
>
> With runtime PM:
>
> // PCI device
> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
> D3cold
> // upstream bridge
> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
> D3cold
>
> So, with runtime PM, both the PCI device and upstream bridge will
> go into D3cold state.
>
> 5. If 'disable_idle_d3' module parameter is set, then also the runtime
> PM will be enabled, but in this case, the usage count should not be
> decremented.
>
> 6. vfio_pci_dev_set_try_reset() return value is unused now, so this
> function return type can be changed to void.
>
> Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx>
> ---
> drivers/vfio/pci/vfio_pci.c | 3 +
> drivers/vfio/pci/vfio_pci_core.c | 95 +++++++++++++++++++++++---------
> include/linux/vfio_pci_core.h | 4 ++
> 3 files changed, 75 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index a5ce92beb655..c8695baf3b54 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -193,6 +193,9 @@ static struct pci_driver vfio_pci_driver = {
> .remove = vfio_pci_remove,
> .sriov_configure = vfio_pci_sriov_configure,
> .err_handler = &vfio_pci_core_err_handlers,
> +#if defined(CONFIG_PM)
> + .driver.pm = &vfio_pci_core_pm_ops,
> +#endif
> };
>
> static void __init vfio_pci_fill_ids(void)
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index f948e6cd2993..c6e4fe9088c3 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -152,7 +152,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
> }
>
> struct vfio_pci_group_info;
> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
> static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> struct vfio_pci_group_info *groups);
>
> @@ -245,7 +245,11 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
> u16 cmd;
> u8 msix_pos;
>
> - vfio_pci_set_power_state(vdev, PCI_D0);
> + if (!disable_idle_d3) {
> + ret = pm_runtime_resume_and_get(&pdev->dev);
> + if (ret < 0)
> + return ret;
> + }

Sorry for the delay in review, I'm a novice in pm runtime, but I
haven't forgotten about the remainder of this series.

I think we're removing the unconditional wake here because we now wake
the device in the core registration function below, but I think there
might be a subtle dependency here on the fix to always wake devices in
the disable function as well, otherwise I'm afraid the power state of a
device released in D3hot could leak to the next user here.

>
> /* Don't allow our initial saved state to include busmaster */
> pci_clear_master(pdev);
> @@ -405,8 +409,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
> out:
> pci_disable_device(pdev);
>
> - if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
> - vfio_pci_set_power_state(vdev, PCI_D3hot);
> + vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
> +
> + /* Put the pm-runtime usage counter acquired during enable */
> + if (!disable_idle_d3)
> + pm_runtime_put(&pdev->dev);
> }
> EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
>
> @@ -1847,19 +1854,20 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>
> vfio_pci_probe_power_state(vdev);
>
> - if (!disable_idle_d3) {
> - /*
> - * pci-core sets the device power state to an unknown value at
> - * bootup and after being removed from a driver. The only
> - * transition it allows from this unknown state is to D0, which
> - * typically happens when a driver calls pci_enable_device().
> - * We're not ready to enable the device yet, but we do want to
> - * be able to get to D3. Therefore first do a D0 transition
> - * before going to D3.
> - */
> - vfio_pci_set_power_state(vdev, PCI_D0);
> - vfio_pci_set_power_state(vdev, PCI_D3hot);
> - }
> + /*
> + * pci-core sets the device power state to an unknown value at
> + * bootup and after being removed from a driver. The only
> + * transition it allows from this unknown state is to D0, which
> + * typically happens when a driver calls pci_enable_device().
> + * We're not ready to enable the device yet, but we do want to
> + * be able to get to D3. Therefore first do a D0 transition
> + * before enabling runtime PM.
> + */
> + vfio_pci_set_power_state(vdev, PCI_D0);
> + pm_runtime_allow(&pdev->dev);
> +
> + if (!disable_idle_d3)
> + pm_runtime_put(&pdev->dev);

I could use some enlightenment here. pm_runtime_allow() only does
something if power.runtime_allow is false, in which case it sets that
value to true and decrements power.usage_count. runtime_allow is
enabled by default in pm_runtime_init(), but pci_pm_init() calls
pm_runtime_forbid() which does the reverse of pm_runtime_allow(). So
do I understand correctly that PCI devices are probed with
runtime_allow = false and a usage_count of 2?

>
> ret = vfio_register_group_dev(&vdev->vdev);
> if (ret)
> @@ -1868,7 +1876,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>
> out_power:
> if (!disable_idle_d3)
> - vfio_pci_set_power_state(vdev, PCI_D0);
> + pm_runtime_get_noresume(&pdev->dev);
> +
> + pm_runtime_forbid(&pdev->dev);
> out_vf:
> vfio_pci_vf_uninit(vdev);
> return ret;
> @@ -1887,7 +1897,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
> vfio_pci_vga_uninit(vdev);
>
> if (!disable_idle_d3)
> - vfio_pci_set_power_state(vdev, PCI_D0);
> + pm_runtime_get_noresume(&pdev->dev);
> +
> + pm_runtime_forbid(&pdev->dev);
> }
> EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
>
> @@ -2093,33 +2105,62 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
> * - At least one of the affected devices is marked dirty via
> * needs_reset (such as by lack of FLR support)
> * Then attempt to perform that bus or slot reset.
> - * Returns true if the dev_set was reset.
> */
> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
> {
> struct vfio_pci_core_device *cur;
> struct pci_dev *pdev;
> int ret;
>
> if (!vfio_pci_dev_set_needs_reset(dev_set))
> - return false;
> + return;
>
> pdev = vfio_pci_dev_set_resettable(dev_set);
> if (!pdev)
> - return false;
> + return;
>
> ret = pci_reset_bus(pdev);
> if (ret)
> - return false;
> + return;
>
> list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
> cur->needs_reset = false;
> - if (!disable_idle_d3)
> - vfio_pci_set_power_state(cur, PCI_D3hot);
> + if (!disable_idle_d3) {
> + /*
> + * Inside pci_reset_bus(), all the devices in bus/slot
> + * will be moved out of D0 state. This state change to

s/out of/into/?

> + * D0 can happen directly without going through the
> + * runtime PM framework. pm_runtime_resume() will
> + * help make the runtime state as active and then
> + * pm_request_idle() can be used which will
> + * internally check for device usage count and will
> + * move the device again into the low power state.
> + */
> + pm_runtime_resume(&pdev->dev);
> + pm_request_idle(&pdev->dev);
> + }
> }
> - return true;
> }
>
> +#ifdef CONFIG_PM
> +static int vfio_pci_core_runtime_suspend(struct device *dev)
> +{
> + return 0;
> +}
> +
> +static int vfio_pci_core_runtime_resume(struct device *dev)
> +{
> + return 0;
> +}
> +
> +const struct dev_pm_ops vfio_pci_core_pm_ops = {
> + SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
> + vfio_pci_core_runtime_resume,
> + NULL)
> +};
> +EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops);
> +#endif

It looks like the vfio_pci_core_pm_ops implementation should all be
moved to where we implement D3cold support, it's not necessary to
implement stubs for any of the functionality of this patch. Thanks,

Alex

> +
> void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
> bool is_disable_idle_d3)
> {
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index ef9a44b6cf5d..aafe09c9fa64 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -231,6 +231,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
> void vfio_pci_core_disable(struct vfio_pci_core_device *vdev);
> void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev);
>
> +#ifdef CONFIG_PM
> +extern const struct dev_pm_ops vfio_pci_core_pm_ops;
> +#endif
> +
> static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
> {
> return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;