Re: [PATCH v3 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper

From: Jason Gunthorpe
Date: Tue Jun 06 2023 - 11:32:32 EST


On Tue, Jun 06, 2023 at 08:32:38AM -0600, Alex Williamson wrote:
> On Mon, 5 Jun 2023 19:53:20 -0700
> <ankita@xxxxxxxxxx> wrote:
>
> > From: Ankit Agrawal <ankita@xxxxxxxxxx>
> >
> > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> > for the on-chip GPU that is the logical OS representation of the
> > internal proprietary cache coherent interconnect.
> >
> > This representation has a number of limitations compared to a real PCI
> > device, in particular, it does not model the coherent GPU memory
> > aperture as a PCI config space BAR, and PCI doesn't know anything
> > about cacheable memory types.
> >
> > Provide a VFIO PCI variant driver that adapts the unique PCI
> > representation into a more standard PCI representation facing
> > userspace. The GPU memory aperture is obtained from ACPI using
> > device_property_read_u64(), according to the FW specification,
> > and exported to userspace as a separate VFIO_REGION. Since the device
> > implements only one 64-bit BAR (BAR0), the GPU memory aperture is mapped
> > to the next available PCI BAR (BAR2). Qemu will then naturally generate a
> > PCI device in the VM with two 64-bit BARs (where the cacheable aperture
> > reported in BAR2).
> >
> > Since this memory region is actually cache coherent with the CPU, the
> > VFIO variant driver will mmap it into VMA using a cacheable mapping. The
> > mapping is done using remap_pfn_range().
> >
> > PCI BAR are aligned to the power-of-2, but the actual memory on the
> > device may not. The physical address from the last device PFN up to the
> > next power-of-2 aligned PA thus is mapped to a dummy PFN through
> > vm_operations fault.
>
> As noted in the QEMU series, this all suggests to me that we should
> simply expose a device specific region for this coherent memory which
> QEMU can then choose to expose to the VM as a BAR, or not.

It doesn't expose as a BAR on bare metal due to a HW limitation. When
we look toward VFIO CXL devices I would expect them to have proper
BARs and not this ACPI hack.

So the approach is to compartmentalize the hack to the bare metal
kernel driver and let the ABI and qemu parts be closer to what CXL
will eventually need.

> It's clearly not a BAR on bare metal, so if we need to go to all the
> trouble to create ACPI tables to further define the coherent memory
> space,

The ACPI tables shouldn't relate to the "BAR", they are needed to
overcome the NUMA problems in the kernel in the same way real device
FW does.

> what's the benefit of pretending that it's a PCI BAR? ie. Why should a
> VM view this as a BAR rather than follow the same semantics as bare
> metal?

Primarily it is a heck of a lot simpler in qemu and better aligned
with where things are going.

> We can then have a discussion whether this even needs to be a variant
> driver versus a vfio-pci quirk if this device specific region is the
> only feature provided (ie. is migration in the future for this
> driver?).

There is alot more here, go back to the original v1 posting to see it
all. This is way too much to be just a quirk.

Jason