Re: [PATCH v12 5/5] iommu/vt-d: improve ITE fault handling if target device isn't present

From: Ethan Zhao
Date: Wed Jan 31 2024 - 00:42:51 EST


On 1/30/2024 5:24 PM, Tian, Kevin wrote:
From: Ethan Zhao <haifeng.zhao@xxxxxxxxxxxxxxx>
Sent: Tuesday, January 30, 2024 5:13 PM

On 1/30/2024 4:43 PM, Tian, Kevin wrote:
From: Ethan Zhao <haifeng.zhao@xxxxxxxxxxxxxxx>
Sent: Tuesday, January 30, 2024 4:16 PM

On 1/30/2024 2:22 PM, Tian, Kevin wrote:
Here we need consider two situations.

One is that the device is not bound to a driver or bound to a driver
which doesn't do active work to the device when it's removed. In
that case one may observe the timeout situation only in the removal
path as the stack dump in your patch02 shows.
When iommu_bus_notifier() got called for hotplug removal cases to
flush devTLB (ATS invalidation), driver was already unloaded.
whatever safe removal or surprise removal. so in theory no active
driver working there.

pciehp_ist()
pciehp_disable_slot()
remove_board()
pciehp_unconfigure_device()
pci_stop_and_remove_bus_device()
pci_stop_bus_device()--->here unload driver
pci_remove_bus_device()->here qi_flush_dev_iotlb() got called.
yes, so patch02 can fix this case.

patch02 can fix that case by checking whether the device is present
to skip sending the invalidation requests. So the logic being discussed
here doesn't matter.

The 2nd situation is more tricky. The device might be bound to
a driver which is doing active work to the device with in-fly
ATS invalidation requests. In this case in-fly requests must be aborted
before the driver can be detached from the removed device.
Conceptually
a device is removed from the bus only after its driver is detached.
Some tricky situations:

1. The ATS invalidation request is issued from driver driver, while it is
in handling, device is removed. this momment, the device instance still
exists in the bus list. yes, if searching it by BDF, could get it.
it's searchable between the point where the device is removed and the
point where the driver is unloaded:

CPU0 CPU1
(Driver is active) (pciehp handler)
qi_submit_sync() pciehp_ist()
... ...
loop for completion() { pciehp_unconfigure_device()
... pci_dev_set_disconnected()
if (ITE) { ...
//find pci_dev from sid pci_remove_bus_device()
if (pci_dev_is_connected()) device_del()
break; bus_remove_device()
} device_remove_driver()
If the device was hot plugin or re-scanned, the device has a PCI_DEV_ADDED
flag,
in this case is pci_dev_is_disconnected() true or false?

how is this patch supposed to work with it?

pci_dev_is_disconnected() is true for safe removal, false for surprise removal, but it not called in this patch, is used in patch[2/5], explained in its commit log. This patch use the pci_device_is_present() to check device present or not. if pci_dev_is_disconnected() returns true, then check its presence by pci vendor
configuration reading (a specific protocal in PCIe spec).


if so the driver unloading work isn't defered to the tail of device_del(), it
is unloaded before pci_remove_bus_device()->device_del(), in pci_stop_dev

pci_stop_bus_device()
pci_stop_dev()
{
if (pci_dev_is_added(dev)) {
device_release_driver(&dev->dev);
}
no matter where driver unload is requested, it needs to wait for aborting
in-fly request on CPU0.

yes, the progress of driver unloading has complex sync mechanism in
__device_release_driver() to do that.


So the interval the device is searchable, only applied to those devices
not hot plugged, or never be scanned.

and in the worst case even if pci_dev is not searchable, isn't it already
an indicator that the device is absent then qi_submit_sync() should
just exit upon ITE?

Hmmm, pci_dev is not searchable, but that pci_dev instance is just not in
the bus list or device list, not mean is disconnected or not present that
moment. :)


Thanks,
Ethan