Re: [PATCH] PCI: pciehp: Prevent child devices from doing RPM on PCIe Link Down

From: Lukas Wunner
Date: Wed Oct 18 2023 - 05:44:43 EST


[cc -= unrelated mailing lists bpf, kernel-hardening]

On Tue, Oct 17, 2023 at 10:25:39AM +0000, Ricky WU wrote:
> > On Mon, Oct 16, 2023 at 12:01:31PM +0800, Kai-Heng Feng wrote:
> > > When inserting an SD7.0 card to Realtek card reader, it can trigger
> > > PCI slot Link down and causes the following error:
> >
> > Why does *inserting* a card cause a Link Down?
> >
> >
> > > [ 63.898861] pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down
> > > [ 63.912118] BUG: unable to handle page fault for address:
> > ffffb24d403e5010
> > [...]
> > > [ 63.912198] ? asm_exc_page_fault+0x27/0x30
> > > [ 63.912203] ? ioread32+0x2e/0x70
> > > [ 63.912206] ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci]
> > > [ 63.912217] rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci]
> > > [ 63.912226] rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci]
> > > [ 63.912234] rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci]
> > > [ 63.912243] ? __pfx_pci_pm_runtime_idle+0x10/0x10
> > > [ 63.912246] pci_pm_runtime_idle+0x34/0x70
> > > [ 63.912248] rpm_idle+0xc4/0x2b0
> > > [ 63.912251] pm_runtime_work+0x93/0xc0
> > > [ 63.912254] process_one_work+0x21a/0x430
> > > [ 63.912258] worker_thread+0x4a/0x3c0
> >
> > This looks like pcr->remap_addr is accessed after it has been iounmap'ed in
> > rtsx_pci_remove() or before it has been iomap'ed in rtsx_pci_probe().
> >
> > Is the card reader itself located below a hotplug port and unplugged here?
>
> Yes it is card reader unplug itself, because sd7.0 card is not used
> rtsx_pcr, it use nvme driver

I can only guess here as the dmesg and lspci output I requested
hasn't been provided:

I assume that this card reader may contain a PCIe switch with a
regular SD card reader below a first Downstream Port (which is
hotplug-capable). If an SD express card is inserted, the regular
SD card reader disappears from the Downstream Port and the
inserted card appears as an NVMe drive (possibly below a
second Downstream Port, or below the first Downstream Port).

The commit message is somewhat misleading in that it links the
unhandled page fault to card insertion. That may be the trigger,
but the root cause appears to be that a runtime PM request is
performed asynchronously after the SD card reader has iounmap'ed
pcr->remap_addr.

If that is indeed the root cause (as I suspect), the fix needs to
be placed somewhere else.

pciehp is only one of several hotplug drivers and fixing this only
in pciehp may leave the other hotplug drivers vulnerable to the same
issue.

Unfortunately the information provided so far is insufficient to
recommend a better fix. It would be necessary to debug why the
asynchronous RPM request is not canceled even though rtsx_pci_remove()
runtime resumes the device before iounmap'ing pcr->remap_addr.
Perhaps there's a bug in runtime PM code that causes asynchronous
requests to not be canceled upon a pm_runtime_get_sync()?
Or perhaps a wmb() is necessary in pci_device_remove() after setting
pci_dev->driver = NULL?

Thanks,

Lukas