Re: Kernel panic while doing vfio-pci hot-plug/unplug test

From: Bjorn Helgaas
Date: Thu Oct 24 2019 - 18:32:04 EST


On Thu, Oct 24, 2019 at 04:22:32AM -0700, Matthew Wilcox wrote:
> On Wed, Oct 23, 2019 at 03:46:38PM -0500, Bjorn Helgaas wrote:
> > [+cc Thomas, Rafael, beginning of thread at
> > https://lore.kernel.org/r/79827f2f-9b43-4411-1376-b9063b67aee3@xxxxxxxxxx]
> >
> > On Wed, Oct 23, 2019 at 09:38:51AM -0700, Matthew Wilcox wrote:
> > > On Wed, Oct 23, 2019 at 10:15:40AM -0500, Bjorn Helgaas wrote:
> > > > I don't like being one of a handful of callers of __add_wait_queue(),
> > > > so I like that solution from that point of view.
> > > >
> > > > The 7ea7e98fd8d0 ("PCI: Block on access to temporarily unavailable pci
> > > > device") commit log suggests that using __add_wait_queue() is a
> > > > significant optimization, but I don't know how important that is in
> > > > practical terms. Config accesses are never a performance path anyway,
> > > > so I'd be inclined to use add_wait_queue() unless somebody complains.
> > >
> > > Wow, this has got pretty messy in the umpteen years since I last looked
> > > at it.
> > >
> > > Some problems I see:
> > >
> > > 1. Commit df65c1bcd9b7b639177a5a15da1b8dc3bee4f5fa (tglx) says:
> > >
> > > x86/PCI: Select CONFIG_PCI_LOCKLESS_CONFIG
> > >
> > > All x86 PCI configuration space accessors have either their own
> > > serialization or can operate completely lockless (ECAM).
> > >
> > > Disable the global lock in the generic PCI configuration space accessors.
> > >
> > > The concept behind this patch is broken. We still need to lock out
> > > config space accesses when devices are undergoing D-state transitions.
> > > I would suggest that for the contention case that tglx is concerned about,
> > > we should have a pci_bus_read_config_unlocked_##size set of functions
> > > which can be used for devices we know never go into D states.
> >
> > Host bridges that can't do config accesses atomically, e.g., they have
> > something like the 0xcf8/0xcfc addr/data ports, need serialization.
> > CONFIG_PCI_LOCKLESS_CONFIG removes the use of pci_lock for that, and I
> > think that part makes sense regardless of whether devices can enter D
> > states.
>
> I disagree. If a device is in D state, we need to block the access.
> Maybe there needs to be a different mechanism for doing it that's not
> a machine-wide lock, but it needs to happen.

I'm missing something here. If we're using 0xcf8/0xcfc, we need to
serialize so the address in 0xcf8 is preserved until we do the load or
store to 0xcfc, and x86 does that with pci_config_lock. If we're
using ECAM, we don't need that serialization.

We might need serialization for some other reason, but not because of
that host bridge config access mechanism.

Config accesses are supposed to work fine in all D states except
D3cold, as long as any upstream bridges are in D0. At least, I think
that's what the spec says.

> > We *should* prevent config accesses during D-state transitions (per
> > PCIe r5.0, sec 5.9), but I don't think pci_lock ever did that.
>
> It used to set block_ucfg_access. Maybe that's been lost; I see
> there are still calls to pci_dev_lock() in pci_reset_function(),
> for example.

You have an incredible memory! pci_ucfg_wait was renamed to
pci_cfg_wait in 2011 with fb51ccbf217c ("PCI: Rework config space
blocking services").

But even then, I don't see a connection with D-state transitions,
e.g., there's nothing in pci_raw_set_power_state() that calls
pci_dev_lock() or otherwise sets block_cfg_wait.