Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

From: David Woodhouse
Date: Sun Nov 08 2020 - 14:37:07 EST


On Sun, 2020-11-08 at 19:47 +0100, Thomas Gleixner wrote:
> This only works when the guest OS actually knows that it runs in a
> VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
> solved because from the guest OS view that's the same as running on bare
> metal. Obviously on bare metal the Vector domain can and must handle
> this.
>
> So this needs some thought.

The problem here is that Intel implemented interrupt remapping in a way
which is anathema to structured, ordered IRQ domains.

When a guest writes an MSI message (addr/data) to the MSI table of a
PCI device which has been assigned to that guest, it *doesn't* properly
inherit the MSI composition from a parent irqdomain which knows about
the (host-side) IOMMU.

What actually happens is the hypervisor *traps* the writes to the
device's MSI table, and translates them *then*. In *precisely* the
fashion which we're trying to avoid for IMS.

Now, you can imagine a world where it wasn't like this, where
Remappable Format MSI messages don't exist, and where we let guests
write native MSI message to the device without trapping — and where the
IOMMU then sees the incoming interrupt and has to map the APIC ID to a
*virtual* CPU for that guest, based on the PCI source-id of the device.

In that world, IMS would work naturally. But that isn't how Intel
designed interrupt remapping. They *designed* to have to trap and
translate as the message is written to the device.

So it does look like we're going to need a hypercall interface to
compose an MSI message on behalf of the guest, for IMS to use. In fact
PCI devices assigned to a guest could use that too, and then we'd only
need to trap-and-remap any attempt to write a Compatibility Format MSI
to the device's MSI table, while letting Remappable Format messages get
written directly.

We'd also need a way for an OS running on bare metal to *know* that
it's on bare metal and can just compose MSI messages for itself. Since
we do expect bare metal to have an IOMMU, perhaps that is just a
feature flag on the IOMMU?

That or Intel needs to fix the IOMMU to do proper virtualisation and
actually translate "Compatibility Format" MSIs for a guest too.

Attachment: smime.p7s
Description: S/MIME cryptographic signature