Re: [PATCH v3 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper

From: Alex Williamson
Date: Tue Jun 06 2023 - 14:14:42 EST


On Tue, 6 Jun 2023 14:16:42 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Tue, Jun 06, 2023 at 11:05:10AM -0600, Alex Williamson wrote:
>
> > It actually seems more complicated this way. We're masquerading this
> > region as a BAR, but then QEMU needs to know based on device IDs that
> > it's really not a BAR, it has special size properties, mapping
> > attributes, error handling, etc.
>
> This seems like something has gone wrong then. ie the SIGUBS error
> handling stuff should be totally generic in the qemu side. Mapping
> attributes are set by the kernel, qemu shouldn't know, doesn't need to
> know.

You asked me to look at the v1 posting to see why there's so much more
going on here than a quirk. That's what I read from the first public
posting, a coherent memory region masqueraded as a BAR which requires
different memory mapping and participates in ECC. I agree that the
actual mapping is done by the kernel, but it doesn't really make a
difference if that's a vfio-pci variant driver providing a different
mmap callback for a BAR region or a device specific region handler.

> The size issue is going to a be a problem in future anyhow, I expect
> some new standards coming to support non-power-two sizes and they will
> want to map to PCI devices in VMs still.

Ok, but a PCI BAR has specific constraints and a non-power-of-2 BAR is
not software compatible with those constraints. That's obviously not
to say that a new capability couldn't expose arbitrary resources sizes
on a PCI-like device though. I don't see how a non-power-of-2 BAR at
this stage helps or fits within any spec, which is exactly what's
being proposed through this BAR masquerade.

> It seems OK to me if qemu can do this generically for any "BAR"
> region, at least creating an entire "nvidia only" code path just for
> non power 2 BAR sizing seems like a bad ABI choice.

Have you looked at Ankit's QEMU series? It's entirely NVIDIA-only code
paths. Also nothing here precludes that shared code in QEMU might
expose some known arbitrary sized regions as a BAR, or whatever spec
defined thing allows that in the future. It would only be a slight
modification in the QEMU code to key on the presence of a device
specific region rather than PCI vendor and device IDs, to then register
that region as a PCI BAR and proceed with all this NVIDIA specific
PXM/SRAT setup. IMO it makes a lot more sense to create memory-only
NUMA nodes based on a device specific region than it does a PCI BAR.

> > I'm not privy to a v1, the earliest I see is this (v3):
> >
> > https://lore.kernel.org/all/20230405180134.16932-1-ankita@xxxxxxxxxx/
> >
> > That outlines that we have a proprietary interconnect exposing cache
> > coherent memory which requires use of special mapping attributes vs a
> > standard PCI BAR and participates in ECC. All of which seems like it
> > would be easier to setup in QEMU if the vfio-pci representation of the
> > device didn't masquerade this regions as a standard BAR. In fact it
> > also reminds me of NVlink2 coherent RAM on POWER machines that was
> > similarly handled as device specific regions.
>
> It wasn't so good on POWER and if some of that stuff has been done
> more generally we would have been further ahead here..

Specifics? Nothing here explained why masquerading the coherent memory
as a BAR in the vfio-pci ABI is anything more than a hack that QEMU
could assemble on its own with a device specific region. Thanks,

Alex