Re: [RFC PATCH v2 4/5] vfio/pci: Invalidate mmaps and block the access in D3hot power state

From: Abhishek Sahu
Date: Mon Feb 21 2022 - 03:12:37 EST


On 2/18/2022 4:44 AM, Alex Williamson wrote:
> On Mon, 24 Jan 2022 23:47:25 +0530
> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
>
>> According to [PCIe v5 5.3.1.4.1] for D3hot state
>>
>> "Configuration and Message requests are the only TLPs accepted by a
>> Function in the D3Hot state. All other received Requests must be
>> handled as Unsupported Requests, and all received Completions may
>> optionally be handled as Unexpected Completions."
>>
>> Currently, if the vfio PCI device has been put into D3hot state and if
>> user makes non-config related read/write request in D3hot state, these
>> requests will be forwarded to the host and this access may cause
>> issues on a few systems.
>>
>> This patch leverages the memory-disable support added in commit
>> 'abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on
>> disabled memory")' to generate page fault on mmap access and
>> return error for the direct read/write. If the device is D3hot state,
>> then the error needs to be returned for all kinds of BAR
>> related access (memory, IO and ROM). Also, the power related structure
>> fields need to be protected so we can use the same 'memory_lock' to
>> protect these fields also. For the few cases, this 'memory_lock' will be
>> already acquired by callers so introduce a separate function
>> vfio_pci_set_power_state_locked(). The original
>> vfio_pci_set_power_state() now contains the code to do the locking
>> related operations.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx>
>> ---
>> drivers/vfio/pci/vfio_pci_core.c | 47 +++++++++++++++++++++++++-------
>> drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++----
>> 2 files changed, 51 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index ee2fb8af57fa..38440d48973f 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -201,11 +201,12 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
>> }
>>
>> /*
>> - * pci_set_power_state() wrapper handling devices which perform a soft reset on
>> - * D3->D0 transition. Save state prior to D0/1/2->D3, stash it on the vdev,
>> - * restore when returned to D0. Saved separately from pci_saved_state for use
>> - * by PM capability emulation and separately from pci_dev internal saved state
>> - * to avoid it being overwritten and consumed around other resets.
>> + * vfio_pci_set_power_state_locked() wrapper handling devices which perform a
>> + * soft reset on D3->D0 transition. Save state prior to D0/1/2->D3, stash it
>> + * on the vdev, restore when returned to D0. Saved separately from
>> + * pci_saved_state for use by PM capability emulation and separately from
>> + * pci_dev internal saved state to avoid it being overwritten and consumed
>> + * around other resets.
>> *
>> * There are few cases where the PCI power state can be changed to D0
>> * without the involvement of this API. So, cache the power state locally
>> @@ -215,7 +216,8 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
>> * The memory taken for saving this PCI state needs to be freed to
>> * prevent memory leak.
>> */
>> -int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
>> +static int vfio_pci_set_power_state_locked(struct vfio_pci_core_device *vdev,
>> + pci_power_t state)
>> {
>> struct pci_dev *pdev = vdev->pdev;
>> bool needs_restore = false, needs_save = false;
>> @@ -260,6 +262,26 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>> return ret;
>> }
>>
>> +/*
>> + * vfio_pci_set_power_state() takes all the required locks to protect
>> + * the access of power related variables and then invokes
>> + * vfio_pci_set_power_state_locked().
>> + */
>> +int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev,
>> + pci_power_t state)
>> +{
>> + int ret;
>> +
>> + if (state >= PCI_D3hot)
>> + vfio_pci_zap_and_down_write_memory_lock(vdev);
>> + else
>> + down_write(&vdev->memory_lock);
>> +
>> + ret = vfio_pci_set_power_state_locked(vdev, state);
>> + up_write(&vdev->memory_lock);
>> + return ret;
>> +}
>> +
>> int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>> {
>> struct pci_dev *pdev = vdev->pdev;
>> @@ -354,7 +376,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>> * in running the logic needed for D0 power state. The subsequent
>> * runtime PM API's will put the device into the low power state again.
>> */
>> - vfio_pci_set_power_state(vdev, PCI_D0);
>> + vfio_pci_set_power_state_locked(vdev, PCI_D0);
>>
>> /* Stop the device from further DMA */
>> pci_clear_master(pdev);
>> @@ -967,7 +989,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>> * interaction. Update the power state in vfio driver to perform
>> * the logic needed for D0 power state.
>> */
>> - vfio_pci_set_power_state(vdev, PCI_D0);
>> + vfio_pci_set_power_state_locked(vdev, PCI_D0);
>> up_write(&vdev->memory_lock);
>>
>> return ret;
>> @@ -1453,6 +1475,11 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>> goto up_out;
>> }
>>
>> + if (vdev->power_state >= PCI_D3hot) {
>> + ret = VM_FAULT_SIGBUS;
>> + goto up_out;
>> + }
>> +
>> /*
>> * We populate the whole vma on fault, so we need to test whether
>> * the vma has already been mapped, such as for concurrent faults
>> @@ -1902,7 +1929,7 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>> * be able to get to D3. Therefore first do a D0 transition
>> * before enabling runtime PM.
>> */
>> - vfio_pci_set_power_state(vdev, PCI_D0);
>> + vfio_pci_set_power_state_locked(vdev, PCI_D0);
>> pm_runtime_allow(&pdev->dev);
>>
>> if (!disable_idle_d3)
>> @@ -2117,7 +2144,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>> * interaction. Update the power state in vfio driver to perform
>> * the logic needed for D0 power state.
>> */
>> - vfio_pci_set_power_state(cur, PCI_D0);
>> + vfio_pci_set_power_state_locked(cur, PCI_D0);
>> if (cur == cur_mem)
>> is_mem = false;
>> if (cur == cur_vma)
>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
>> index 57d3b2cbbd8e..e97ba14c4aa0 100644
>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
>> @@ -41,8 +41,13 @@
>> static int vfio_pci_iowrite##size(struct vfio_pci_core_device *vdev, \
>> bool test_mem, u##size val, void __iomem *io) \
>> { \
>> + down_read(&vdev->memory_lock); \
>> + if (vdev->power_state >= PCI_D3hot) { \
>> + up_read(&vdev->memory_lock); \
>> + return -EIO; \
>> + } \
>> + \
>
> The reason that we only set test_mem for MMIO BARs is that systems are
> generally more lenient about probing unresponsive I/O port space to
> support legacy use cases. Have you found cases where access to an I/O
> port BAR when the device is either in D3hot+ or I/O port is disabled in
> the command register triggers a system fault? If not it seems we could
> roll the power_state check into __vfio_pci_memory_enabled(), if so then
> we probably need to improve our coverage of access to disabled I/O port
> BARs beyond only the power_state check. Thanks,
>
> Alex
>

I have not seen any system unresponsive in the systems which I am using
for testing these patches. If I try to access MMIO BAR or IO port while
the device is in D3hot+, then I am getting all 0xff. Since I was not
sure regarding the behaviour in other systems while the device is in
D3hot+, so I did power_state check outside.

We can start with power_state check under __vfio_pci_memory_enabled()
and improve coverage later-on if any issue arises.

Thanks,
Abhishek

>> if (test_mem) { \
>> - down_read(&vdev->memory_lock); \
>> if (!__vfio_pci_memory_enabled(vdev)) { \
>> up_read(&vdev->memory_lock); \
>> return -EIO; \
>> @@ -51,8 +56,7 @@ static int vfio_pci_iowrite##size(struct vfio_pci_core_device *vdev, \
>> \
>> vfio_iowrite##size(val, io); \
>> \
>> - if (test_mem) \
>> - up_read(&vdev->memory_lock); \
>> + up_read(&vdev->memory_lock); \
>> \
>> return 0; \
>> }
>> @@ -68,8 +72,13 @@ VFIO_IOWRITE(64)
>> static int vfio_pci_ioread##size(struct vfio_pci_core_device *vdev, \
>> bool test_mem, u##size *val, void __iomem *io) \
>> { \
>> + down_read(&vdev->memory_lock); \
>> + if (vdev->power_state >= PCI_D3hot) { \
>> + up_read(&vdev->memory_lock); \
>> + return -EIO; \
>> + } \
>> + \
>> if (test_mem) { \
>> - down_read(&vdev->memory_lock); \
>> if (!__vfio_pci_memory_enabled(vdev)) { \
>> up_read(&vdev->memory_lock); \
>> return -EIO; \
>> @@ -78,8 +87,7 @@ static int vfio_pci_ioread##size(struct vfio_pci_core_device *vdev, \
>> \
>> *val = vfio_ioread##size(io); \
>> \
>> - if (test_mem) \
>> - up_read(&vdev->memory_lock); \
>> + up_read(&vdev->memory_lock); \
>> \
>> return 0; \
>> }
>