Re: [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state

From: Abhishek Sahu
Date: Tue Nov 23 2021 - 02:27:28 EST


On 11/19/2021 9:15 PM, Alex Williamson wrote:
> On Thu, 18 Nov 2021 14:09:13 -0700
> Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:
>
>> On Thu, 18 Nov 2021 20:51:41 +0530
>> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
>>> On 11/17/2021 11:23 PM, Alex Williamson wrote:
>>> Thanks Alex for checking this series and providing your inputs.
>>>
>>>> If we're transitioning a device to D3cold rather than D3hot as
>>>> requested by userspace, isn't that a user visible change?
>>>
>>> For most of the driver, in linux kernel, the D3hot vs D3cold
>>> state will be decided at PCI core layer. In the PCI core layer,
>>> pci_target_state() determines which D3 state to choose. It checks
>>> for platform_pci_power_manageable() and then it calls
>>> platform_pci_choose_state() to find the target state.
>>> In VM, the platform_pci_power_manageable() check will fail if the
>>> guest is linux OS. So, it uses, D3hot state.
>>
>> Right, but my statement is really more that the device PM registers
>> cannot be used to put the device into D3cold, so the write of the PM
>> register that we're trapping was the user/guest's intention to put the
>> device into D3hot. We therefore need to be careful about differences
>> in the resulting device state when it comes out of D3cold vs D3hot.
>>>>> But there are few drivers which does not use the PCI framework
>>> generic power related routines during runtime suspend/system suspend
>>> and set the PCI power state directly with D3hot.
>>
>> Current vfio-pci being one of those ;)
>>
>>> Also, the guest can be non-Linux OS also and, in that case,
>>> it will be difficult to know the behavior. So, it may impact
>>> these cases.
>>
>> That's what I'm worried about.
>>
>>>> For instance, a device may report NoSoftRst- indicating that the device
>>>> does not do a soft reset on D3hot->D0 transition. If we're instead
>>>> putting the device in D3cold, then a transition back to D0 has very
>>>> much undergone a reset. On one hand we should at least virtualize the
>>>> NoSoftRst bit to allow the guest to restore the device, but I wonder if
>>>> that's really safe. Is a better option to prevent entering D3cold if
>>>> the device isn't natively reporting NoSoftRst-?
>>>>
>>>
>>> You mean to say NoSoftRst+ instead of NoSoftRst- as visible in
>>
>> Oops yes. The concern is if the user/guest is not expecting a soft
>> reset when using D3hot, but we transparently promote D3hot to D3cold
>> which will always implies a device reset.
>>
>>> the lspci output. For NoSoftRst- case, we do a soft reset on
>>> D3hot->D0 transition. But, will this case not be handled internally
>>> in drivers/pci/pci-driver.c ? For both system suspend and runtime suspend,
>>> we check for pci_dev->state_saved flag and do pci_save_state()
>>> irrespective of NoSoftRst bit. For NoSoftRst- case, pci_restore_bars()
>>> will be called in pci_raw_set_power_state() which will reinitialize device
>>> for D3hot/D3cold-> D0 case. Once the device is initialized in the host,
>>> then for guest, it should work without re-initializing again in the
>>> guest side. I am not sure, if my understanding is correct.
>>
>> The soft reset is not limited to the state that the PCI subsystem can
>> save and restore. Device specific state that the user/guest may
>> legitimately expect to be retained may be reset as well.
>>
>> [PCIe v5 5.3.1.4]
>> Functional context is required to be maintained by Functions in
>> the D3 hot state if the No_Soft_Reset field in the PMCSR is Set.
>>
>> Unfortunately I don't see a specific definition of "functional
>> context", but I interpret that to include device specific state. For
>> example, if a GPU contains specific frame buffer data and reports
>> NoSoftRst+, wouldn't it be reasonable to expect that framebuffer data
>> to be retained on D3hot->D0 transition?
>>

Thanks Alex for your inputs.

I got your point. Yes. That can happen.

>>>> We're also essentially making a policy decision on behalf of
>>>> userspace that favors power saving over latency. Is that
>>>> universally the correct trade-off?
>>>
>>> For most drivers, the D3hot vs D3cold should not be favored due
>>> to latency reasons. In the linux kernel side, I am seeing, the
>>> PCI framework try to use D3cold state if platform and device
>>> supports that. But its correct that covertly replacing D3hot with
>>> D3cold may be concern for some drivers.
>>>
>>>> I can imagine this could be desirable for many use cases,
>>>> but if we're going to covertly replace D3hot with D3cold, it seems
>>>> like there should be an opt-in. Is co-opting the PM capability for
>>>> this even really acceptable or should there be a device ioctl to
>>>> request D3cold and plumbing through QEMU such that a VM guest can
>>>> make informed choices regarding device power management?
>>>>
>>>
>>> Making IOCTL is also an option but that case, this support needs to
>>> be added in all hypervisors and user must pass this information
>>> explicitly for each device. Another option could be to use
>>> module parameter to explicitly enable D3cold support. If module
>>> parameter is not set, then we can call pci_d3cold_disable() and
>>> in that case, runtime PM should not use D3cold state.
>>>
>>> Also, I was checking we can pass this information though some
>>> virtualized register bit which will be only defined for passing
>>> the information between guest and host. In the guest side if the
>>> target state is being decided with pci_target_state(), then
>>> the D3cold vs D3hot should not matter for the driver running
>>> in the guest side and in that case, it depends upon platform support.
>>> We can set this virtualize bit to 1. But, if driver is either
>>> setting D3hot state explicitly or has called pci_d3cold_disable() or
>>> similar API available in the guest OS, then set this bit to 0 and
>>> in that case, the D3cold state can be disabled in the host side.
>>> But don't know if is possible to use some non PCI defined
>>> virtualized register bit.
>>
>> If you're suggesting a device config space register, that's troublesome
>> because we can't guarantee that simply because a range of config space
>> isn't within a capability that it doesn't have some device specific
>> purpose. However, we could certainly implement virtual registers in
>> the hypervisor that implement the ACPI support that an OS would use on
>> bare metal to implement D3cold. Those could trigger this ioctl through
>> the vfio device.
>>

Yes. I was suggesting a some bits in PM_CTRL register.
We can identify some bits which are unused like bit 22 and 23.

23:22 Undefined - these bits were defined in previous specifications.
They should be ignored by software.

and virtualize these bits for passing D3cold related information
from guest to host. These bits were defined for PCI-to-PCI
bridge earlier. But this will be hack and will cause issues
if PCI specification starts using these bits in future revisions.
Also, it needs the changes in the other OS. So, it won't be good
idea.

>>> I am not sure what should be best option to make choice
>>> regarding d3cold but if we can have some option by which this
>>> can be done without involvement of user, then it will benefit
>>> for lot of cases. Currently, the D3cold is supported only in
>>> very few desktops/servers but in future, we will see on
>>> most of the platforms.
>>
>> I tend to see it as an interesting hack to promote D3hot to D3cold, and
>> potentially very useful. However, we're also introducing potentially
>> unexpected device behavior, so I think it would probably need to be an
>> opt-in. Possibly if the device reports NoSoftRst- we could use it by
>> default, but even virtualizing the NoSoftRst suggests that there's an
>> expectation that the guest driver has that support available.
>>

Sure. Based on the discussion, the best option is to provide IOCTL
to user for transition to D3cold. The hypervisor can implement the
the required ACPI power resource for D3Hot and D0 and then once
guest OS calls these power resource methods,
then it can invoke the IOCTL to the host side vfio-pci driver.

The D0/D1/D2/D3hot transition can happen with existing way
by directly writing into PM config register and the IOCTL
needs to transition from D3hot to D3cold via runtime PM
framework.

I will do more analysis regarding this by doing code changes
and will update.

>>>> Also if the device is not responsive to config space due to the user
>>>> placing it in D3 now, I'd expect there are other ioctl paths that
>>>> need to be blocked, maybe even MMIO paths that might be a gap for
>>>> existing D3hot support. Thanks,
>>>
>>> I was in assumption that most of IOCTL code will be called by the
>>> hypervisor before guest OS boot and during that time, the device
>>> will be always in D0. But, if we have paths where IOCTL can be
>>> called when the device has been suspended by guest OS, then can we
>>> use runtime_get/put API’s there also ?
>>
>> It's more a matter of preventing user actions that can cause harm
>> rather than expecting certain operations only in specific states. We
>> could chose to either resume the device for those operations or fail
>> the operation. We should probably also leverage the memory-disable
>> support to fault mmap access to MMIO when the device is in D3* as well.
>

Sure. I can explore regarding this as well.

> It also occurred to me last night that a guest triggering D3hot via the
> PM registers must be a synchronous power state change, we can't use
> auto-suspend. This is necessary for nested assignment where the guest
> might use a D3hot->D0 power state transition with NoSoftRst- devices in
> order to perform a reset of the device. With auto-suspend, the guest
> would return the device to D0 before the physical device ever timed out
> to enter a D3 state. Thanks,
>
> Alex
>

Okay. I think if we can use IOCTL based way to trigger D3cold, then
autosuspend should not be required. Also, in that case, the transition
to D3hot can happen with existing way of writing directly into PCI
PM register. I will validate this use-case after doing changes
with IOCTL based design. I will make the changes in QEMU locally
to validate my changes.

Thanks,
Abhishek