On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote:
For those endpoint devices connect to system via hotplug capable ports,
users could request a warm reset to the device by flapping device's link
through setting the slot's link control register,
Well, users could just *unplug* the device, right? Why is it relevant
that thay could fiddle with registers in config space?
as pciehpt_ist() DLLSC
interrupt sequence response, pciehp will unload the device driver and
then power it off. thus cause an IOMMU devTLB flush request for device to
be sent and a long time completion/timeout waiting in interrupt context.
A completion timeout should be on the order of usecs or msecs, why does it
cause a hard lockup? The dmesg excerpt you've provided shows a 12 *second*
delay between hot removal and watchdog reaction.
Fix it by checking the device's error_state in
devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
request to link down device that is set to pci_channel_io_perm_failure and
then powered off in
This doesn't seem to be a proper fix. It will work most of the time
but not always. A user might bring down the slot via sysfs, then yank
the card from the slot just when the iommu flush occurs such that the
pci_dev_is_disconnected(pdev) check returns false but the card is
physically gone immediately afterwards. In other words, you've shrunk
the time window during which the issue may occur, but haven't eliminated
it completely.