Re: [PATCH v3 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper

From: Jason Gunthorpe
Date: Tue Jun 06 2023 - 20:14:24 EST


On Tue, Jun 06, 2023 at 03:30:57PM -0600, Alex Williamson wrote:
> > To emulate PCI, someone, somewhere, has to fix this mismatch up.
> >
> > So given choices
> > 1) Qemu sees a special NVIDIA thing and fixes it
> > 2) Qemu sees a VFIO_PCI_BAR0_REGION with an odd size and fixes it
> > 3) Kernel lies and makes a power-2 size and it fixes it
> >
> > 2 seems the most forward looking and reusable.
>
> What?! It's not just a matter of fixing it. The vfio-pci uAPI should
> never return a BAR region that's not compatible as a BAR.

Why? But OK, if you don't like then then let's call it
VFIO_PCI_BAR0_REGION_NOT_POW2. Not seeing that it really helps so
much..

> It's incorrectly sized, it does special things with mmap under the
> covers, and it doesn't honor the memory enable bit.

The mmap attributes stuff is not uAPI visible, so it doesn't matter.

Normal vfio-pci devices will SIGBUS on memory disable, this could do
that if it was really important (I don't think it is)

So we are left with.. The size is weird. Someone has to provide the
fixing to fit that into the PCI config space world because we are
emulating a real PCI device.

The fixing is generic, a generic function does not elevate to create a
vendor uAPI IMHO.

> And then QEMU goes on to ignore this peculiarity when setting up all
> the ACPI features, instead relying on the PCI vendor/device ID when
> it could be using a device specific region to initiate that support.

We really should not rely on vendor regions to trigger device specific
VMM behaviors for a variant driver. If we want to do this better than
vendor/device ID we should have a VFIO ioctl report which variant
driver is running the device so userspace can do whatever.

> > I definately don't think this is important enough to stick a vendor
> > label on it.
>
> How high is the bar for a device specific region? This certainly looks
> and smells like one to me.

I would say if the thing that is showing up on the VM side is not PCI
then maybe a vendor label might make sense.

> > Broadly, we are looking toward a way for the kernel VFIO variant
> > driver to provide the majority of the "PCI emulation" and the VMM can
> > be general. It is not nice if every PCI emulation type driver needs
> > unique modifications to the VMM to support it. We are planning more
> > than one of these things already, and there are industry standards
> > afoot that will widly open the door here.
>
> Meanwhile every VMM needs a hook to extend non-compliant BAR sizes,
> assuming the kernel will fixup mappings beyond the region extent,

Yes! It is a basically a new generic VFIO ability to allow this size
adaptation. If you don't like this version of the uAPI then lets tweak
it, but it still needs the same basic operation where the kernel tells
userspace that a certain mmap is to be placed in a certain BAR config
space.

> pretend that none of this is a device bug? It really is a very small
> amount of code in QEMU to setup a MemoryRegion based on a device
> specific region and register it as a PCI BAR. The non-standard size is
> a factor here when mapping to the VM address space, but I'm a bit
> surprised to hear an argument for hacking that in the kernel rather
> than userspace.

Well, I'd rather do it in userspace.

> > The only special bit is emulating the weird Grace FW ACPI stuff.
>
> And a device specific region seems like the ideal jumping off point to
> create that memory-only node for this thing.

It really has nothing to do with the regions, it is something that is
needed if this variant driver is being used at all. The vPCI device
will work without the ACPI, but the Linux drivers won't like it.

> > So lets find a way to give these things appropriate generic names at
> > the ABI level please..
>
> What is the generic feature that "these things" implement?

As far as I can see, non-power-2 size is the thing the VMM needs to
worry about.

And maybe a generic way to detect which variant driver is running.

> There's a lot of vendor specific things going on here. Not only is all
> that "weird Grace FW ACPI stuff" based on this region, but also if we
> are exposing it as a BAR, which BAR index(s) for a given device.

The kernel decides the BAR indexes, not the vmm, because broadly we
want to have the kernel in charge of making the synthetic config
space.

The ACPI is not related to the region. It is just creating many empty
NUMA nodes. They should have no CPUs and no memory. The patch is
trying to make the insertion of the ACPI automatic. Keying it off a
region is not right for the purpose.

> If "the industry" does come out with a spec for "these things",
> couldn't QEMU optionally plug a device specific region into the
> implementation of that spec, potentially along with some commonly
> defined region to make this device appear to honor that new spec?
> Versus with the in-kernel variant driver masquerading the BAR, we're
> stuck with what the kernel implements. Would future hardware
> implementing to this new spec require a variant driver or would we
> extend vfio-pci to support them with extended regions and expect the
> VMM to compose them appropriately?

Look at Intel's SIOV document for some idea, I expect devices like
that will be a VFIO driver (not a variant PCI driver) that largely
exposes the vfio-pci uAPI with the purpose of creating a vPCI device
in the VM.

There is no real PCI function under this so all the config space and
so on will be synthetic. It is convenient if the kernel driver does
this so that it works on all the VMMs generically.

Non-power-2 BAR is desirable in this world because address space is
precious at high scale and power2 scaling gets wasteful.

Further, I would expect there will be a generic CXL driver and generic
CXL related ACPI someday.

This device is sort of in the middle where it does have a real PCI
function but it is also synthesizing some config space. We have
another VFIO driver in progress that is also doing some modification
of the config space..

Jason