Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling

From: Alex Williamson
Date: Tue Jan 23 2024 - 10:56:17 EST


On Tue, 23 Jan 2024 11:45:19 +0100
Lukas Wunner <lukas@xxxxxxxxx> wrote:

> On Mon, Jan 22, 2024 at 03:50:03PM -0700, Alex Williamson wrote:
> > On Mon, 22 Jan 2024 23:17:30 +0100 Lukas Wunner <lukas@xxxxxxxxx> wrote:
> > > On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote:
> > > > To do that I used pm_runtime_get_if_active(), but in retrospect this
> > > > requires the device to be in RPM_ACTIVE so we end up skipping anything
> > > > suspended or transitioning.
> > >
> > > How about dropping the calls to pm_runtime_get_if_active() and
> > > pm_runtime_put() and instead simply do:
> > >
> > > if (pm_runtime_suspended(&pdev->dev) &&
> > > pdev->current_state != PCI_D3cold)
> > > pci_pme_wakeup(pdev, NULL);
> >
> > Do we require that the polled device is in the RPM_SUSPENDED state?
>
> If the device is RPM_SUSPENDING, why immediately resume it for polling?
> It's sufficient to poll it the next time around, i.e. 1 second later.
>
> Likewise, if it's already RPM_RESUMING or RPM_ACTIVE anyway, no need
> to poll PME.

I'm clearly not an expert on PME, but this is not obvious to me and
before the commit that went in through this thread, PME wakeup was
triggered regardless of the PM state. I was trying to restore the
behavior of not requiring a specific PM state other than deferring
polling across transition states.

> This leaves RPM_SUSPENDED as the only state in which it makes sense to
> poll.
>
> > Also pm_runtime_suspended() can also only be trusted while holding the
> > device power.lock, we need a usage count reference to maintain that
> > state.
>
> Why? Let's say there's a race and the device resumes immediately after
> we call pm_runtime_suspended() here. So we might call pci_pme_wakeup()
> gratuitouly. So what? No biggie.

The issue I'm trying to address is that config space of the device can
become inaccessible while calling pci_pme_wakeup() on it, causing a
system fault on some hardware. So a gratuitous pci_pme_wakeup() can be
detrimental.

We require the device config space to remain accessible, therefore the
instantaneous test against D3cold and that the parent bridge is in D0
is not sufficient. I see traces where the parent bridge is in D0, but
the PM state is RPM_SUSPENDING and the endpoint device transitions to
D3cold while we're executing pci_pme_wakeup().

Therefore at a minimum, I think we need to enforce that the bridge is
in RPM_ACTIVE and remains in that state across pci_pme_wakeup(), which
means we need to hold a usage count reference, and that usage count
reference must be acquired under power.lock in RPM_ACTIVE state to be
effective.

> > + if (bdev) {
> > + spin_lock_irq(&bdev->power.lock);
>
> Hm, I'd expect that lock to be internal to the PM core,
> although there *are* a few stray users outside of it.

Right, there are. It's possible that if we only need to hold a
reference on the bridge we can abstract this through
pm_runtime_get_if_active(), the semantics worked better to essentially
open code it in this iteration though. Thanks,

Alex