Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling

From: Alex Williamson
Date: Mon Jan 22 2024 - 17:50:21 EST


On Mon, 22 Jan 2024 23:17:30 +0100
Lukas Wunner <lukas@xxxxxxxxx> wrote:

> On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote:
> > On Thu, 3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@xxxxxxxxxx wrote:
> > > Testing that a device is not currently in a low power state provides no
> > > guarantees that the device is not immenently transitioning to such a state.
> > > We need to increment the PM usage counter before accessing the device.
> > > Since we don't wish to wake the device for PME polling, do so only if the
> > > device is already active by using pm_runtime_get_if_active().
> > >
> > > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > > ---
> > > drivers/pci/pci.c | 23 ++++++++++++++++-------
> > > 1 file changed, 16 insertions(+), 7 deletions(-)
> >
> > Resurrecting this patch (currently commit d3fcd7360338) for discussion
> > as it's been identified as the source of a regression in:
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=218360
> >
> > Copying Mika, Lukas, and Rafael as it's related to:
> >
> > 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold")
> >
> > where we skip devices in D3cold when processing the PME list.
> >
> > I think the issue in the above bz is that the downstream TB3/USB4 port
> > is in D3 (presumably D3hot) and I therefore infer the device is in state
> > RPM_SUSPENDED. This commit is attempting to make sure the device power
> > state is stable across the call such that it does not transition into
> > D3cold while we're accessing it.
> >
> > To do that I used pm_runtime_get_if_active(), but in retrospect this
> > requires the device to be in RPM_ACTIVE so we end up skipping anything
> > suspended or transitioning.
>
> How about dropping the calls to pm_runtime_get_if_active() and
> pm_runtime_put() and instead simply do:
>
> if (pm_runtime_suspended(&pdev->dev) &&
> pdev->current_state != PCI_D3cold)
> pci_pme_wakeup(pdev, NULL);

Hi Lukas,

Do we require that the polled device is in the RPM_SUSPENDED state?
Also pm_runtime_suspended() can also only be trusted while holding the
device power.lock, we need a usage count reference to maintain that
state.

I'm also seeing cases where the bridge is power state D0, but PM state
RPM_SUSPENDING, so config space of the polled device becomes
inaccessible even while we're holding a reference once we allow polling
in RPM_SUSPENDED.

I'm currently working with the below patch, which I believe addresses
all these issues, but I'd welcome review and testing. Thanks,

Alex

commit 0a063b8e91d0bc807db712c81c8b270864f99ecb
Author: Alex Williamson <alex.williamson@xxxxxxxxxx>
Date: Tue Jan 16 13:28:33 2024 -0700

PCI: Fix active state requirement in PME polling

The commit noted in fixes added a bogus requirement that runtime PM
managed devices need to be in the RPM_ACTIVE state for PME polling.
In fact, there is no requirement of a specific runtime PM state, it
is only required that the state is stable such that testing config
space availability, ie. !D3cold, remains valid across the PME wakeup.

To that effect, defer polling of runtime PM managed devices that are
not in either the RPM_ACTIVE or RPM_SUSPENDED states. Devices in
transitional states remain on the pci_pme_list and will be re-queued.

However in allowing polling of devices in the RPM_SUSPENDED state,
the bridge state requires further refinement as it's possible to poll
while the bridge is in D0, but the runtime PM state is RPM_SUSPENDING.
An asynchronous completion of the bridge transition to a low power
state can make config space of the subordinate device become
unavailable. A runtime PM reference to the bridge is therefore added
with a supplementary requirement that the bridge is in the RPM_ACTIVE
state.

Fixes: d3fcd7360338 ("PCI: Fix runtime PM race with PME polling")
Reported-by: Sanath S <sanath.s@xxxxxxx>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218360
Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index bdbf8a94b4d0..31dbf1834b07 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2433,29 +2433,45 @@ static void pci_pme_list_scan(struct work_struct *work)
if (pdev->pme_poll) {
struct pci_dev *bridge = pdev->bus->self;
struct device *dev = &pdev->dev;
- int pm_status;
+ struct device *bdev = bridge ? &bridge->dev : NULL;

/*
- * If bridge is in low power state, the
- * configuration space of subordinate devices
- * may be not accessible
+ * If we have a bridge, it should be in an active/D0
+ * state or the configuration space of subordinate
+ * devices may not be accessible.
*/
- if (bridge && bridge->current_state != PCI_D0)
- continue;
+ if (bdev) {
+ spin_lock_irq(&bdev->power.lock);
+ if (!pm_runtime_active(bdev) ||
+ bridge->current_state != PCI_D0) {
+ spin_unlock_irq(&bdev->power.lock);
+ continue;
+ }
+ pm_runtime_get_noresume(bdev);
+ spin_unlock_irq(&bdev->power.lock);
+ }

/*
- * If the device is in a low power state it
- * should not be polled either.
+ * The device itself may be either in active or
+ * suspended state, but must not be in D3cold so
+ * that configuration space is accessible. The
+ * transitional resuming and suspending states are
+ * skipped to avoid D3cold races.
*/
- pm_status = pm_runtime_get_if_active(dev, true);
- if (!pm_status)
- continue;
-
- if (pdev->current_state != PCI_D3cold)
+ spin_lock_irq(&dev->power.lock);
+ if ((pm_runtime_active(dev) ||
+ pm_runtime_suspended(dev)) &&
+ pdev->current_state != PCI_D3cold) {
+ pm_runtime_get_noresume(dev);
+ spin_unlock_irq(&dev->power.lock);
pci_pme_wakeup(pdev, NULL);
-
- if (pm_status > 0)
pm_runtime_put(dev);
+ } else {
+ spin_unlock_irq(&dev->power.lock);
+ }
+
+ if (bdev)
+ pm_runtime_put(bdev);
} else {
list_del(&pme_dev->list);
kfree(pme_dev);