Re: [PATCH v3 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper

From: Alex Williamson
Date: Tue Jun 06 2023 - 13:06:06 EST


On Tue, 6 Jun 2023 12:32:13 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Tue, Jun 06, 2023 at 08:32:38AM -0600, Alex Williamson wrote:
> > On Mon, 5 Jun 2023 19:53:20 -0700
> > <ankita@xxxxxxxxxx> wrote:
> >
> > > From: Ankit Agrawal <ankita@xxxxxxxxxx>
> > >
> > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> > > for the on-chip GPU that is the logical OS representation of the
> > > internal proprietary cache coherent interconnect.
> > >
> > > This representation has a number of limitations compared to a real PCI
> > > device, in particular, it does not model the coherent GPU memory
> > > aperture as a PCI config space BAR, and PCI doesn't know anything
> > > about cacheable memory types.
> > >
> > > Provide a VFIO PCI variant driver that adapts the unique PCI
> > > representation into a more standard PCI representation facing
> > > userspace. The GPU memory aperture is obtained from ACPI using
> > > device_property_read_u64(), according to the FW specification,
> > > and exported to userspace as a separate VFIO_REGION. Since the device
> > > implements only one 64-bit BAR (BAR0), the GPU memory aperture is mapped
> > > to the next available PCI BAR (BAR2). Qemu will then naturally generate a
> > > PCI device in the VM with two 64-bit BARs (where the cacheable aperture
> > > reported in BAR2).
> > >
> > > Since this memory region is actually cache coherent with the CPU, the
> > > VFIO variant driver will mmap it into VMA using a cacheable mapping. The
> > > mapping is done using remap_pfn_range().
> > >
> > > PCI BAR are aligned to the power-of-2, but the actual memory on the
> > > device may not. The physical address from the last device PFN up to the
> > > next power-of-2 aligned PA thus is mapped to a dummy PFN through
> > > vm_operations fault.
> >
> > As noted in the QEMU series, this all suggests to me that we should
> > simply expose a device specific region for this coherent memory which
> > QEMU can then choose to expose to the VM as a BAR, or not.
>
> It doesn't expose as a BAR on bare metal due to a HW limitation. When
> we look toward VFIO CXL devices I would expect them to have proper
> BARs and not this ACPI hack.
>
> So the approach is to compartmentalize the hack to the bare metal
> kernel driver and let the ABI and qemu parts be closer to what CXL
> will eventually need.
>
> > It's clearly not a BAR on bare metal, so if we need to go to all the
> > trouble to create ACPI tables to further define the coherent memory
> > space,
>
> The ACPI tables shouldn't relate to the "BAR", they are needed to
> overcome the NUMA problems in the kernel in the same way real device
> FW does.
>
> > what's the benefit of pretending that it's a PCI BAR? ie. Why should a
> > VM view this as a BAR rather than follow the same semantics as bare
> > metal?
>
> Primarily it is a heck of a lot simpler in qemu and better aligned
> with where things are going.

It actually seems more complicated this way. We're masquerading this
region as a BAR, but then QEMU needs to know based on device IDs that
it's really not a BAR, it has special size properties, mapping
attributes, error handling, etc. Maybe we should have taken the hint
that it's not affected by the PCI config space memory enable bit that a
BAR region is not the right way for vfio to compose the device.

It's really beside the point whether you want QEMU to expose the memory
region to the VM as a BAR, but the more I see how this works the more
it makes sense to me that this should be a device specific region that
is the trigger for QEMU to setup these special properties. It is
trivial for QEMU to expose a region as a BAR and then it can manage the
size issues for mapping, keeping things like an overflow page out of
the kernel.

> > We can then have a discussion whether this even needs to be a variant
> > driver versus a vfio-pci quirk if this device specific region is the
> > only feature provided (ie. is migration in the future for this
> > driver?).
>
> There is alot more here, go back to the original v1 posting to see it
> all. This is way too much to be just a quirk.

I'm not privy to a v1, the earliest I see is this (v3):

https://lore.kernel.org/all/20230405180134.16932-1-ankita@xxxxxxxxxx/

That outlines that we have a proprietary interconnect exposing cache
coherent memory which requires use of special mapping attributes vs a
standard PCI BAR and participates in ECC. All of which seems like it
would be easier to setup in QEMU if the vfio-pci representation of the
device didn't masquerade this regions as a standard BAR. In fact it
also reminds me of NVlink2 coherent RAM on POWER machines that was
similarly handled as device specific regions. Thanks,

Alex