Re: [RFC PATCH v2 1/5] vfio/pci: register vfio-pci driver with runtime PM framework

From: Abhishek Sahu
Date: Mon Feb 21 2022 - 01:35:44 EST


On 2/17/2022 5:18 AM, Alex Williamson wrote:
> On Mon, 24 Jan 2022 23:47:22 +0530
> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
>
>> Currently, there is very limited power management support
>> available in the upstream vfio-pci driver. If there is no user of vfio-pci
>> device, then the PCI device will be moved into D3Hot state by writing
>> directly into PCI PM registers. This D3Hot state help in saving power
>> but we can achieve zero power consumption if we go into the D3cold state.
>> The D3cold state cannot be possible with native PCI PM. It requires
>> interaction with platform firmware which is system-specific.
>> To go into low power states (including D3cold), the runtime PM framework
>> can be used which internally interacts with PCI and platform firmware and
>> puts the device into the lowest possible D-States.
>>
>> This patch registers vfio-pci driver with the runtime PM framework.
>>
>> 1. The PCI core framework takes care of most of the runtime PM
>> related things. For enabling the runtime PM, the PCI driver needs to
>> decrement the usage count and needs to register the runtime
>> suspend/resume callbacks. For vfio-pci based driver, these callback
>> routines can be stubbed in this patch since the vfio-pci driver
>> is not doing the PCI device initialization. All the config state
>> saving, and PCI power management related things will be done by
>> PCI core framework itself inside its runtime suspend/resume callbacks.
>>
>> 2. Inside pci_reset_bus(), all the devices in bus/slot will be moved
>> out of D0 state. This state change to D0 can happen directly without
>> going through the runtime PM framework. So if runtime PM is enabled,
>> then pm_runtime_resume() makes the runtime state active. Since the PCI
>> device power state is already D0, so it should return early when it
>> tries to change the state with pci_set_power_state(). Then
>> pm_request_idle() can be used which will internally check for
>> device usage count and will move the device again into the low power
>> state.
>>
>> 3. Inside vfio_pci_core_disable(), the device usage count always needs
>> to be decremented which was incremented in vfio_pci_core_enable().
>>
>> 4. Since the runtime PM framework will provide the same functionality,
>> so directly writing into PCI PM config register can be replaced with
>> the use of runtime PM routines. Also, the use of runtime PM can help
>> us in more power saving.
>>
>> In the systems which do not support D3Cold,
>>
>> With the existing implementation:
>>
>> // PCI device
>> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>> D3hot
>> // upstream bridge
>> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>> D0
>>
>> With runtime PM:
>>
>> // PCI device
>> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>> D3hot
>> // upstream bridge
>> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>> D3hot
>>
>> So, with runtime PM, the upstream bridge or root port will also go
>> into lower power state which is not possible with existing
>> implementation.
>>
>> In the systems which support D3Cold,
>>
>> // PCI device
>> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>> D3hot
>> // upstream bridge
>> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>> D0
>>
>> With runtime PM:
>>
>> // PCI device
>> # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>> D3cold
>> // upstream bridge
>> # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>> D3cold
>>
>> So, with runtime PM, both the PCI device and upstream bridge will
>> go into D3cold state.
>>
>> 5. If 'disable_idle_d3' module parameter is set, then also the runtime
>> PM will be enabled, but in this case, the usage count should not be
>> decremented.
>>
>> 6. vfio_pci_dev_set_try_reset() return value is unused now, so this
>> function return type can be changed to void.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx>
>> ---
>> drivers/vfio/pci/vfio_pci.c | 3 +
>> drivers/vfio/pci/vfio_pci_core.c | 95 +++++++++++++++++++++++---------
>> include/linux/vfio_pci_core.h | 4 ++
>> 3 files changed, 75 insertions(+), 27 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index a5ce92beb655..c8695baf3b54 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -193,6 +193,9 @@ static struct pci_driver vfio_pci_driver = {
>> .remove = vfio_pci_remove,
>> .sriov_configure = vfio_pci_sriov_configure,
>> .err_handler = &vfio_pci_core_err_handlers,
>> +#if defined(CONFIG_PM)
>> + .driver.pm = &vfio_pci_core_pm_ops,
>> +#endif
>> };
>>
>> static void __init vfio_pci_fill_ids(void)
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index f948e6cd2993..c6e4fe9088c3 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -152,7 +152,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>> }
>>
>> struct vfio_pci_group_info;
>> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>> static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>> struct vfio_pci_group_info *groups);
>>
>> @@ -245,7 +245,11 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>> u16 cmd;
>> u8 msix_pos;
>>
>> - vfio_pci_set_power_state(vdev, PCI_D0);
>> + if (!disable_idle_d3) {
>> + ret = pm_runtime_resume_and_get(&pdev->dev);
>> + if (ret < 0)
>> + return ret;
>> + }
>
> Sorry for the delay in review, I'm a novice in pm runtime, but I
> haven't forgotten about the remainder of this series.
>

Thanks Alex.
Should I include linux-pm@xxxxxxxxxxxxxxx while sending the updated
version. I got following comment in my different patch related
with PCI PM
(https://lore.kernel.org/lkml/20220204233219.GA228585@bhelgaas/T/#me17cb6e1aa3848cfd4ea577a3c93ebbbfdbf7c73)

"generally PM patches should be CCed to linux-pm anyway"


> I think we're removing the unconditional wake here because we now wake
> the device in the core registration function below, but I think there
> might be a subtle dependency here on the fix to always wake devices in
> the disable function as well, otherwise I'm afraid the power state of a
> device released in D3hot could leak to the next user here.
>

Yes. We need to consider the fix.
Either we can add the state restore handling logic inside
vfio_pci_core_runtime_resume() or we can keep restore the state
alone here explictly. For runtime PM, we need to call
pm_runtime_resume_and_get() first since root port should be moved
to D0 first.

>>
>> /* Don't allow our initial saved state to include busmaster */
>> pci_clear_master(pdev);
>> @@ -405,8 +409,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>> out:
>> pci_disable_device(pdev);
>>
>> - if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
>> - vfio_pci_set_power_state(vdev, PCI_D3hot);
>> + vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
>> +
>> + /* Put the pm-runtime usage counter acquired during enable */
>> + if (!disable_idle_d3)
>> + pm_runtime_put(&pdev->dev);
>> }
>> EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
>>
>> @@ -1847,19 +1854,20 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>
>> vfio_pci_probe_power_state(vdev);
>>
>> - if (!disable_idle_d3) {
>> - /*
>> - * pci-core sets the device power state to an unknown value at
>> - * bootup and after being removed from a driver. The only
>> - * transition it allows from this unknown state is to D0, which
>> - * typically happens when a driver calls pci_enable_device().
>> - * We're not ready to enable the device yet, but we do want to
>> - * be able to get to D3. Therefore first do a D0 transition
>> - * before going to D3.
>> - */
>> - vfio_pci_set_power_state(vdev, PCI_D0);
>> - vfio_pci_set_power_state(vdev, PCI_D3hot);
>> - }
>> + /*
>> + * pci-core sets the device power state to an unknown value at
>> + * bootup and after being removed from a driver. The only
>> + * transition it allows from this unknown state is to D0, which
>> + * typically happens when a driver calls pci_enable_device().
>> + * We're not ready to enable the device yet, but we do want to
>> + * be able to get to D3. Therefore first do a D0 transition
>> + * before enabling runtime PM.
>> + */
>> + vfio_pci_set_power_state(vdev, PCI_D0);
>> + pm_runtime_allow(&pdev->dev);
>> +
>> + if (!disable_idle_d3)
>> + pm_runtime_put(&pdev->dev);
>
> I could use some enlightenment here. pm_runtime_allow() only does
> something if power.runtime_allow is false, in which case it sets that
> value to true and decrements power.usage_count. runtime_allow is
> enabled by default in pm_runtime_init(), but pci_pm_init() calls
> pm_runtime_forbid() which does the reverse of pm_runtime_allow(). So
> do I understand correctly that PCI devices are probed with
> runtime_allow = false and a usage_count of 2?
>

Following is the flow w.r.t. usage_count and runtime_allow.

In pci_pm_init(), the default usage_count=0 and runtime_allow=true initially.
pm_runtime_forbid() in pci_pm_init() makes usage_count=1 and runtime_allow=false

Then, inside local_pci_probe(), pm_runtime_get_sync() is called,
After this, the usage_count=2 and runtime_allow=false

So, you are correct that the PCI devices are probed with
runtime_allow=false and usage_count=2.

In the driver,

pm_runtime_allow() is for doing the reverse of pm_runtime_forbid().
and pm_runtime_put() is for doing the reverse of pm_runtime_get_sync().

>>
>> ret = vfio_register_group_dev(&vdev->vdev);
>> if (ret)
>> @@ -1868,7 +1876,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>
>> out_power:
>> if (!disable_idle_d3)
>> - vfio_pci_set_power_state(vdev, PCI_D0);
>> + pm_runtime_get_noresume(&pdev->dev);
>> +
>> + pm_runtime_forbid(&pdev->dev);
>> out_vf:
>> vfio_pci_vf_uninit(vdev);
>> return ret;
>> @@ -1887,7 +1897,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>> vfio_pci_vga_uninit(vdev);
>>
>> if (!disable_idle_d3)
>> - vfio_pci_set_power_state(vdev, PCI_D0);
>> + pm_runtime_get_noresume(&pdev->dev);
>> +
>> + pm_runtime_forbid(&pdev->dev);
>> }
>> EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
>>
>> @@ -2093,33 +2105,62 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
>> * - At least one of the affected devices is marked dirty via
>> * needs_reset (such as by lack of FLR support)
>> * Then attempt to perform that bus or slot reset.
>> - * Returns true if the dev_set was reset.
>> */
>> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>> {
>> struct vfio_pci_core_device *cur;
>> struct pci_dev *pdev;
>> int ret;
>>
>> if (!vfio_pci_dev_set_needs_reset(dev_set))
>> - return false;
>> + return;
>>
>> pdev = vfio_pci_dev_set_resettable(dev_set);
>> if (!pdev)
>> - return false;
>> + return;
>>
>> ret = pci_reset_bus(pdev);
>> if (ret)
>> - return false;
>> + return;
>>
>> list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
>> cur->needs_reset = false;
>> - if (!disable_idle_d3)
>> - vfio_pci_set_power_state(cur, PCI_D3hot);
>> + if (!disable_idle_d3) {
>> + /*
>> + * Inside pci_reset_bus(), all the devices in bus/slot
>> + * will be moved out of D0 state. This state change to
>
> s/out of/into/?
>

Yes. I will fix this.

>> + * D0 can happen directly without going through the
>> + * runtime PM framework. pm_runtime_resume() will
>> + * help make the runtime state as active and then
>> + * pm_request_idle() can be used which will
>> + * internally check for device usage count and will
>> + * move the device again into the low power state.
>> + */
>> + pm_runtime_resume(&pdev->dev);
>> + pm_request_idle(&pdev->dev);
>> + }
>> }
>> - return true;
>> }
>>
>> +#ifdef CONFIG_PM
>> +static int vfio_pci_core_runtime_suspend(struct device *dev)
>> +{
>> + return 0;
>> +}
>> +
>> +static int vfio_pci_core_runtime_resume(struct device *dev)
>> +{
>> + return 0;
>> +}
>> +
>> +const struct dev_pm_ops vfio_pci_core_pm_ops = {
>> + SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
>> + vfio_pci_core_runtime_resume,
>> + NULL)
>> +};
>> +EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops);
>> +#endif
>
> It looks like the vfio_pci_core_pm_ops implementation should all be
> moved to where we implement D3cold support, it's not necessary to
> implement stubs for any of the functionality of this patch. Thanks,
>

We need to provide dev_pm_ops atleast to make runtime PM working.
In pci_pm_runtime_idle() generic function:

const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
if (!pm)
return -ENOSYS;

Without dev_pm_ops, the idle routine will return ENOSYS error.

vfio_pci_core_runtime_{suspend/resume}() stub implementation can be removed
but we need to provide stub vfio_pci_core_pm_ops atleast.

const struct dev_pm_ops vfio_pci_core_pm_ops = { };
EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops);

Thanks,
Abhishek

> Alex
>
>> +
>> void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
>> bool is_disable_idle_d3)
>> {
>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index ef9a44b6cf5d..aafe09c9fa64 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -231,6 +231,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
>> void vfio_pci_core_disable(struct vfio_pci_core_device *vdev);
>> void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev);
>>
>> +#ifdef CONFIG_PM
>> +extern const struct dev_pm_ops vfio_pci_core_pm_ops;
>> +#endif
>> +
>> static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>> {
>> return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>