Re: [PATCH v3 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper

From: Jason Gunthorpe
Date: Tue Jun 13 2023 - 10:35:54 EST


On Wed, Jun 07, 2023 at 12:23:03PM -0600, Alex Williamson wrote:

> > The fixing is generic, a generic function does not elevate to create a
> > vendor uAPI IMHO.
>
> Who else is trying to expose a non-power-of-2 region as a BAR right
> now?

I see a few needs in other places internally that are not public yet,
especially in the SIOV world I alluded to below.

> We have neither a specification nor a complimentary implementation
> from which to derive generic support.

"specification"? It is literally - the size of the bar is not a power
of two, so make the resulting hole when mapping that to a vPCI discard
writes and read 0s.

This isn't a PCI specification, it is a contract between the kernel
side and the VMM side within VFIO about how to handle VFIO devices
where the kernel driver wants to have a padding.

It makes sense that we shouldn't just do this blidly for the existing
indexes without some negotation, but that isn't a PCI "compliance"
problem, that is a incomplete VFIO uAPI design in this patch.

> GPUs seem to manage to have non-power-of-2 size VRAM while still
> providing BARs that are a power-of-2, ex. 6GB VRAM, 8GB BAR.

Generally excess BAR is modeled in HW as discard writes return 0 (or
maybe return -1). HW can do this easially. SW is more tricky

> Why shouldn't the variant driver here extend the BAR region to a
> power-of-2 and then we can decide the error handling should accesses
> exceed the implemented range?

This sounds doable, I don't know if Ankit had a problem with using the
sparse mmap feature to do this. We'd need to have the padding be
non-mmapable space and then the kernel driver would do the discard/0
with the read/write calls.

If this works out I'm happy enough if we go this way too. I suppose it
allows better compatibility with all the VMMs.

> > I would say if the thing that is showing up on the VM side is not PCI
> > then maybe a vendor label might make sense.
>
> Well, how do you suppose a device with a non-power-of-2 BAR is PCI
> compliant? You're asking the VMM to assume what the driver meant by
> providing that non-standard BAR, which sounds vendor specific to me.

It shows up as a PCI compliant device in the VM, with a compliant
power of two size.

> Is your primary complaint here that you don't want
> that region to be labeled VFIO_PCI_NVGPU_BAR1?

Yes. This modified VFIO uAPI contract is general enough it should not
be forced as unique to this device.

> We could certainly define VFIO_PCI_VENDOR_BAR0..5 where QEMU knows
> that it's supposed to relax expectations and mangle the region into
> a compliant BAR, but now you're adding complexity that may never be
> used elsewhere.

I wouldn't use a _VENDOR_ name for this since it is generic.

I would suggest using a FEATURE:

/*
* When SET to 1 it indicates that the userspace understands non-power of two
* region sizes on VFIO_PCI_BARx_REGION_INDEX. If the kernel driver requests a
* non-power of two size then if userspace needs to round the size back up to a
* power of two, eg to create a vPCI device, it should return 0 for reads and
* discard writes for the padding that was added.
*
* If this flag is not set, and the VFIO driver cannot work without non-power of
* two BARs then mmap of those BAR indexes will fail. (FIXME: maybe
* this needs more thinking)
*/
#define VFIO_DEVICE_FEATURE_PCI_PAD_BARS 10

As adding new index types for every new functionality will become
combinatoral, and adding discovery of which vPCI BAR index the new
indexes should map too is also complicated..

Though sparse mmap is probably better.

> > It really has nothing to do with the regions, it is something that is
> > needed if this variant driver is being used at all. The vPCI device
> > will work without the ACPI, but the Linux drivers won't like it.
>
> OTOH if the ACPI work is based on device specific regions, the list of
> device IDs in QEMU goes away and support for a new device requires no
> VMM changes.

Using the device ID seems like the better approach as we don't know
what future devices using this varient driver are going to need for
ACPI modeling.

It also seems I was too optimistic to think a simple variant driver ID
would be sufficient. IMHO you are closer below, it depends on what
bare metal FW qemu is trying to emulate, which I suppose is
technically a combination of the machine type and the installed vPCI
devices..

> > The ACPI is not related to the region. It is just creating many empty
> > NUMA nodes. They should have no CPUs and no memory. The patch is
> > trying to make the insertion of the ACPI automatic. Keying it off a
> > region is not right for the purpose.
>
> Why aren't the different NUMA nodes a machine type option? If we start
> having each device mangle the machine in incompatible ways it seems
> like we're going to get conflicts not only with other devices, but also
> user specified NUMA configurations.

You may be right, I think this patch is trying to make things
automatic for user, but a dedicated machine type might make more
sense.

> I'm struggling with whether I can
> set some bits in the root port devcap2 register[1] based on device
> capabilities and here this is fundamentally manipulating the VM
> topology.
>
> [1]https://lore.kernel.org/all/20230526231558.1660396-1-alex.williamson@xxxxxxxxxx/

That does seem a bit dicey - alot of this stuff, especially ACPI, is
reasonably assumed to be set at boot time by an OS. Changing it
dynamically becomes exciting..

Thanks,
Jason