Re: [RESEND PATCH v8 04/11] bus: platform, amba, fsl-mc, PCI: Add device DMA ownership management

From: Zenghui Yu
Date: Mon Jun 26 2023 - 09:03:30 EST


On 2022/4/18 8:49, Lu Baolu wrote:
The devices on platform/amba/fsl-mc/PCI buses could be bound to drivers
with the device DMA managed by kernel drivers or user-space applications.
Unfortunately, multiple devices may be placed in the same IOMMU group
because they cannot be isolated from each other. The DMA on these devices
must either be entirely under kernel control or userspace control, never
a mixture. Otherwise the driver integrity is not guaranteed because they
could access each other through the peer-to-peer accesses which by-pass
the IOMMU protection.

This checks and sets the default DMA mode during driver binding, and
cleanups during driver unbinding. In the default mode, the device DMA is
managed by the device driver which handles DMA operations through the
kernel DMA APIs (see Documentation/core-api/dma-api.rst).

For cases where the devices are assigned for userspace control through the
userspace driver framework(i.e. VFIO), the drivers(for example, vfio_pci/
vfio_platfrom etc.) may set a new flag (driver_managed_dma) to skip this
default setting in the assumption that the drivers know what they are
doing with the device DMA.

Calling iommu_device_use_default_domain() before {of,acpi}_dma_configure
is currently a problem. As things stand, the IOMMU driver ignored the
initial iommu_probe_device() call when the device was added, since at
that point it had no fwspec yet. In this situation,
{of,acpi}_iommu_configure() are retriggering iommu_probe_device() after
the IOMMU driver has seen the firmware data via .of_xlate to learn that
it actually responsible for the given device. As the result, before
that gets fixed, iommu_use_default_domain() goes at the end, and calls
arch_teardown_dma_ops() if it fails.

Cc: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
Cc: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
Cc: Stuart Yoder <stuyoder@xxxxxxxxx>
Cc: Laurentiu Tudor <laurentiu.tudor@xxxxxxx>
Signed-off-by: Lu Baolu <baolu.lu@xxxxxxxxxxxxxxx>
Reviewed-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
Reviewed-by: Jason Gunthorpe <jgg@xxxxxxxxxx>
Reviewed-by: Robin Murphy <robin.murphy@xxxxxxx>
Tested-by: Eric Auger <eric.auger@xxxxxxxxxx>

[...]

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 60adf42460ab..b933d2b08d4d 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -895,6 +895,13 @@ struct module;
* created once it is bound to the driver.
* @driver: Driver model structure.
* @dynids: List of dynamically added device IDs.
+ * @driver_managed_dma: Device driver doesn't use kernel DMA API for DMA.
+ * For most device drivers, no need to care about this flag
+ * as long as all DMAs are handled through the kernel DMA API.
+ * For some special ones, for example VFIO drivers, they know
+ * how to manage the DMA themselves and set this flag so that
+ * the IOMMU layer will allow them to setup and manage their
+ * own I/O address space.
*/
struct pci_driver {
struct list_head node;
@@ -913,6 +920,7 @@ struct pci_driver {
const struct attribute_group **dev_groups;
struct device_driver driver;
struct pci_dynids dynids;
+ bool driver_managed_dma;
};
static inline struct pci_driver *to_pci_driver(struct device_driver *drv)

[...]

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 4ceeb75fc899..f83f7fbac68f 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -20,6 +20,7 @@
#include <linux/of_device.h>
#include <linux/acpi.h>
#include <linux/dma-map-ops.h>
+#include <linux/iommu.h>
#include "pci.h"
#include "pcie/portdrv.h"
@@ -1601,6 +1602,7 @@ static int pci_bus_num_vf(struct device *dev)
*/
static int pci_dma_configure(struct device *dev)
{
+ struct pci_driver *driver = to_pci_driver(dev->driver);
struct device *bridge;
int ret = 0;
@@ -1616,9 +1618,24 @@ static int pci_dma_configure(struct device *dev)
}
pci_put_host_bridge_device(bridge);
+
+ if (!ret && !driver->driver_managed_dma) {
+ ret = iommu_device_use_default_domain(dev);
+ if (ret)
+ arch_teardown_dma_ops(dev);
+ }
+
return ret;
}
+static void pci_dma_cleanup(struct device *dev)
+{
+ struct pci_driver *driver = to_pci_driver(dev->driver);
+
+ if (!driver->driver_managed_dma)
+ iommu_device_unuse_default_domain(dev);
+}
+
struct bus_type pci_bus_type = {
.name = "pci",
.match = pci_bus_match,
@@ -1632,6 +1649,7 @@ struct bus_type pci_bus_type = {
.pm = PCI_PM_OPS_PTR,
.num_vf = pci_bus_num_vf,
.dma_configure = pci_dma_configure,
+ .dma_cleanup = pci_dma_cleanup,
};
EXPORT_SYMBOL(pci_bus_type);

I (somehow forgot to delete DEBUG_TEST_DRIVER_REMOVE in my .config, and)
failed to start the guest with an assigned PCI device, with something
like:

| qemu-system-aarch64: -device vfio-pci,host=0000:03:00.1,id=hostdev0,bus=pci.8,addr=0x0: vfio 0000:03:00.1: group 45 is not viable
| Please ensure all devices within the iommu_group are bound to their vfio bus driver.

It looks like on device probe, with DEBUG_TEST_DRIVER_REMOVE,
.dma_configure() will be executed *twice* via the
really_probe()/re_probe path, and *no* .dma_cleanup() will be executed.
The resulting dev::iommu_group::owner_cnt is 2, which will confuse the
later iommu_group_dma_owner_claimed() call from VFIO on guest startup.

I can locally workaround the problem by deleting the DEBUG option or
performing a .dma_cleanup() during test_remove'ing, but it'd be great if
you can take a look.

Thanks,
Zenghui