Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling

From: Sanath S
Date: Mon Jan 22 2024 - 23:48:49 EST



On 1/23/2024 10:16 AM, Alex Williamson wrote:
On Mon, 22 Jan 2024 15:50:03 -0700
Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:

On Mon, 22 Jan 2024 23:17:30 +0100
Lukas Wunner <lukas@xxxxxxxxx> wrote:

On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote:
On Thu, 3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@xxxxxxxxxx wrote:
Testing that a device is not currently in a low power state provides no
guarantees that the device is not immenently transitioning to such a state.
We need to increment the PM usage counter before accessing the device.
Since we don't wish to wake the device for PME polling, do so only if the
device is already active by using pm_runtime_get_if_active().

Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
---
drivers/pci/pci.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
Resurrecting this patch (currently commit d3fcd7360338) for discussion
as it's been identified as the source of a regression in:

https://bugzilla.kernel.org/show_bug.cgi?id=218360

Copying Mika, Lukas, and Rafael as it's related to:

000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold")

where we skip devices in D3cold when processing the PME list.

I think the issue in the above bz is that the downstream TB3/USB4 port
is in D3 (presumably D3hot) and I therefore infer the device is in state
RPM_SUSPENDED. This commit is attempting to make sure the device power
state is stable across the call such that it does not transition into
D3cold while we're accessing it.

To do that I used pm_runtime_get_if_active(), but in retrospect this
requires the device to be in RPM_ACTIVE so we end up skipping anything
suspended or transitioning.
How about dropping the calls to pm_runtime_get_if_active() and
pm_runtime_put() and instead simply do:

if (pm_runtime_suspended(&pdev->dev) &&
pdev->current_state != PCI_D3cold)
pci_pme_wakeup(pdev, NULL);
Hi Lukas,

Do we require that the polled device is in the RPM_SUSPENDED state?
Also pm_runtime_suspended() can also only be trusted while holding the
device power.lock, we need a usage count reference to maintain that
state.

I'm also seeing cases where the bridge is power state D0, but PM state
RPM_SUSPENDING, so config space of the polled device becomes
inaccessible even while we're holding a reference once we allow polling
in RPM_SUSPENDED.

I'm currently working with the below patch, which I believe addresses
all these issues, but I'd welcome review and testing. Thanks,

Alex

commit 0a063b8e91d0bc807db712c81c8b270864f99ecb
Author: Alex Williamson <alex.williamson@xxxxxxxxxx>
Date: Tue Jan 16 13:28:33 2024 -0700

PCI: Fix active state requirement in PME polling
The commit noted in fixes added a bogus requirement that runtime PM
managed devices need to be in the RPM_ACTIVE state for PME polling.
In fact, there is no requirement of a specific runtime PM state, it
is only required that the state is stable such that testing config
space availability, ie. !D3cold, remains valid across the PME wakeup.
To that effect, defer polling of runtime PM managed devices that are
not in either the RPM_ACTIVE or RPM_SUSPENDED states. Devices in
transitional states remain on the pci_pme_list and will be re-queued.
However in allowing polling of devices in the RPM_SUSPENDED state,
the bridge state requires further refinement as it's possible to poll
while the bridge is in D0, but the runtime PM state is RPM_SUSPENDING.
An asynchronous completion of the bridge transition to a low power
state can make config space of the subordinate device become
unavailable. A runtime PM reference to the bridge is therefore added
with a supplementary requirement that the bridge is in the RPM_ACTIVE
state.
Fixes: d3fcd7360338 ("PCI: Fix runtime PM race with PME polling")
Reported-by: Sanath S <sanath.s@xxxxxxx>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218360
Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index bdbf8a94b4d0..31dbf1834b07 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2433,29 +2433,45 @@ static void pci_pme_list_scan(struct work_struct *work)
if (pdev->pme_poll) {
struct pci_dev *bridge = pdev->bus->self;
struct device *dev = &pdev->dev;
- int pm_status;
+ struct device *bdev = bridge ? &bridge->dev : NULL;
/*
- * If bridge is in low power state, the
- * configuration space of subordinate devices
- * may be not accessible
+ * If we have a bridge, it should be in an active/D0
+ * state or the configuration space of subordinate
+ * devices may not be accessible.
*/
- if (bridge && bridge->current_state != PCI_D0)
- continue;
+ if (bdev) {
+ spin_lock_irq(&bdev->power.lock);
With the code as shown here I have one system that seems to be getting
contention when reading the vpd sysfs attribute when the endpoints
(QL41000) are bound to vfio-pci and unused, resulting in the root port
and endpoints being suspended. A vpd read can take over a minute.
Seems to be resolved changing the above spin_lock to a spin_trylock:

if (!spin_trylock_irq(&bdev->power.lock))
continue;

The pm_runtime_barrier() as used in the vpd path can be prone to such
issues, I saw similar in the fix I previously proposed in the bz.

I'll continue to do more testing with this change and hopefully Sanath
can verify this resolves the bug report. Thanks,

Alex
I'll verify it today and let you know the observations.
+ if (!pm_runtime_active(bdev) ||
+ bridge->current_state != PCI_D0) {
+ spin_unlock_irq(&bdev->power.lock);
+ continue;
+ }
+ pm_runtime_get_noresume(bdev);
+ spin_unlock_irq(&bdev->power.lock);
+ }
/*
- * If the device is in a low power state it
- * should not be polled either.
+ * The device itself may be either in active or
+ * suspended state, but must not be in D3cold so
+ * that configuration space is accessible. The
+ * transitional resuming and suspending states are
+ * skipped to avoid D3cold races.
*/
- pm_status = pm_runtime_get_if_active(dev, true);
- if (!pm_status)
- continue;
-
- if (pdev->current_state != PCI_D3cold)
+ spin_lock_irq(&dev->power.lock);
+ if ((pm_runtime_active(dev) ||
+ pm_runtime_suspended(dev)) &&
+ pdev->current_state != PCI_D3cold) {
+ pm_runtime_get_noresume(dev);
+ spin_unlock_irq(&dev->power.lock);
pci_pme_wakeup(pdev, NULL);
-
- if (pm_status > 0)
pm_runtime_put(dev);
+ } else {
+ spin_unlock_irq(&dev->power.lock);
+ }
+
+ if (bdev)
+ pm_runtime_put(bdev);
} else {
list_del(&pme_dev->list);
kfree(pme_dev);