Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling

From: Alex Williamson
Date: Thu Jan 18 2024 - 13:51:08 EST


On Thu, 3 Aug 2023 11:12:33 -0600
Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:

> Testing that a device is not currently in a low power state provides no
> guarantees that the device is not immenently transitioning to such a state.
> We need to increment the PM usage counter before accessing the device.
> Since we don't wish to wake the device for PME polling, do so only if the
> device is already active by using pm_runtime_get_if_active().
>
> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
> ---
> drivers/pci/pci.c | 23 ++++++++++++++++-------
> 1 file changed, 16 insertions(+), 7 deletions(-)

Hey folks,

Resurrecting this patch (currently commit d3fcd7360338) for discussion
as it's been identified as the source of a regression in:

https://bugzilla.kernel.org/show_bug.cgi?id=218360

Copying Mika, Lukas, and Rafael as it's related to:

000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold")

where we skip devices in D3cold when processing the PME list.

I think the issue in the above bz is that the downstream TB3/USB4 port
is in D3 (presumably D3hot) and I therefore infer the device is in state
RPM_SUSPENDED. This commit is attempting to make sure the device power
state is stable across the call such that it does not transition into
D3cold while we're accessing it.

To do that I used pm_runtime_get_if_active(), but in retrospect this
requires the device to be in RPM_ACTIVE so we end up skipping anything
suspended or transitioning.

As reported in the above bz, I tried replacing this with:

pm_runtime_get_noresume(dev);
pm_runtime_barrier(dev);

The theory here being that the barrier would wait for any transitioning
states such that as far as runtime power management is concerned, the
device power state is stable.

This causes live locks where the barrier never returns.

Instead I'm considering that since we're polling the PME list, maybe we
could just defer devices in transition states, for instance something
that looks like pm_runtime_get_if_active(), but would return zero if
the device was in RPM_SUSPENDING or RPM_RESUMING rather than requiring
RPM_ACTIVE.

I'm not an expert in PME or runtime power management though, so I'm
looking for advice. Thanks,

Alex

>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 60230da957e0..bc266f290b2c 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2415,10 +2415,13 @@ static void pci_pme_list_scan(struct work_struct *work)
>
> mutex_lock(&pci_pme_list_mutex);
> list_for_each_entry_safe(pme_dev, n, &pci_pme_list, list) {
> - if (pme_dev->dev->pme_poll) {
> - struct pci_dev *bridge;
> + struct pci_dev *pdev = pme_dev->dev;
> +
> + if (pdev->pme_poll) {
> + struct pci_dev *bridge = pdev->bus->self;
> + struct device *dev = &pdev->dev;
> + int pm_status;
>
> - bridge = pme_dev->dev->bus->self;
> /*
> * If bridge is in low power state, the
> * configuration space of subordinate devices
> @@ -2426,14 +2429,20 @@ static void pci_pme_list_scan(struct work_struct *work)
> */
> if (bridge && bridge->current_state != PCI_D0)
> continue;
> +
> /*
> - * If the device is in D3cold it should not be
> - * polled either.
> + * If the device is in a low power state it
> + * should not be polled either.
> */
> - if (pme_dev->dev->current_state == PCI_D3cold)
> + pm_status = pm_runtime_get_if_active(dev, true);
> + if (!pm_status)
> continue;
>
> - pci_pme_wakeup(pme_dev->dev, NULL);
> + if (pdev->current_state != PCI_D3cold)
> + pci_pme_wakeup(pdev, NULL);
> +
> + if (pm_status > 0)
> + pm_runtime_put(dev);
> } else {
> list_del(&pme_dev->list);
> kfree(pme_dev);