Re: [PATCH v3 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper

From: Alex Williamson
Date: Tue Jun 06 2023 - 17:32:02 EST


On Tue, 6 Jun 2023 16:05:25 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Tue, Jun 06, 2023 at 12:13:48PM -0600, Alex Williamson wrote:
> > On Tue, 6 Jun 2023 14:16:42 -0300
> > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> >
> > > On Tue, Jun 06, 2023 at 11:05:10AM -0600, Alex Williamson wrote:
> > >
> > > > It actually seems more complicated this way. We're masquerading this
> > > > region as a BAR, but then QEMU needs to know based on device IDs that
> > > > it's really not a BAR, it has special size properties, mapping
> > > > attributes, error handling, etc.
> > >
> > > This seems like something has gone wrong then. ie the SIGUBS error
> > > handling stuff should be totally generic in the qemu side. Mapping
> > > attributes are set by the kernel, qemu shouldn't know, doesn't need to
> > > know.
> >
> > You asked me to look at the v1 posting to see why there's so much more
> > going on here than a quirk. That's what I read from the first public
> > posting, a coherent memory region masqueraded as a BAR which requires
> > different memory mapping and participates in ECC. I agree that the
> > actual mapping is done by the kernel, but it doesn't really make a
> > difference if that's a vfio-pci variant driver providing a different
> > mmap callback for a BAR region or a device specific region handler.
>
> The ECC stuff is generic too.
>
> Even the non-power-2 size thing is *generic*, even if isn't following
> PCI SIG.

Generic yes, but exposing a non-power-of-2 size region for a BAR is...
well, wrong.

> > > The size issue is going to a be a problem in future anyhow, I expect
> > > some new standards coming to support non-power-two sizes and they will
> > > want to map to PCI devices in VMs still.
> >
> > Ok, but a PCI BAR has specific constraints and a non-power-of-2 BAR is
> > not software compatible with those constraints. That's obviously not
> > to say that a new capability couldn't expose arbitrary resources sizes
> > on a PCI-like device though. I don't see how a non-power-of-2 BAR at
> > this stage helps or fits within any spec, which is exactly what's
> > being proposed through this BAR masquerade.
>
> To emulate PCI, someone, somewhere, has to fix this mismatch up.
>
> So given choices
> 1) Qemu sees a special NVIDIA thing and fixes it
> 2) Qemu sees a VFIO_PCI_BAR0_REGION with an odd size and fixes it
> 3) Kernel lies and makes a power-2 size and it fixes it
>
> 2 seems the most forward looking and reusable.

What?! It's not just a matter of fixing it. The vfio-pci uAPI should
never return a BAR region that's not compatible as a BAR. It's
incorrectly sized, it does special things with mmap under the covers,
and it doesn't honor the memory enable bit. And then QEMU goes on to
ignore this peculiarity when setting up all the ACPI features, instead
relying on the PCI vendor/device ID when it could be using a device
specific region to initiate that support.

> I definately don't think this is important enough to stick a vendor
> label on it.

How high is the bar for a device specific region? This certainly looks
and smells like one to me.

> Broadly, we are looking toward a way for the kernel VFIO variant
> driver to provide the majority of the "PCI emulation" and the VMM can
> be general. It is not nice if every PCI emulation type driver needs
> unique modifications to the VMM to support it. We are planning more
> than one of these things already, and there are industry standards
> afoot that will widly open the door here.

Meanwhile every VMM needs a hook to extend non-compliant BAR sizes,
assuming the kernel will fixup mappings beyond the region extent, and
pretend that none of this is a device bug? It really is a very small
amount of code in QEMU to setup a MemoryRegion based on a device
specific region and register it as a PCI BAR. The non-standard size is
a factor here when mapping to the VM address space, but I'm a bit
surprised to hear an argument for hacking that in the kernel rather
than userspace.

> > > It seems OK to me if qemu can do this generically for any "BAR"
> > > region, at least creating an entire "nvidia only" code path just for
> > > non power 2 BAR sizing seems like a bad ABI choice.
> >
> > Have you looked at Ankit's QEMU series?
>
> Not well, I haven't seen any versions of it till it was posted
>
> > It's entirely NVIDIA-only code paths. Also nothing here precludes
> > that shared code in QEMU might expose some known arbitrary sized
> > regions as a BAR, or whatever spec defined thing allows that in the
> > future.
>
> It should not be like that. From a kernel perspective this is
> basically all generic VMM code that can apply to any VFIO driver. The
> kernel side was ment to be a general purpose API for the VMM.
>
> The only special bit is emulating the weird Grace FW ACPI stuff.

And a device specific region seems like the ideal jumping off point to
create that memory-only node for this thing.

> > > > coherent memory which requires use of special mapping attributes vs a
> > > > standard PCI BAR and participates in ECC. All of which seems like it
> > > > would be easier to setup in QEMU if the vfio-pci representation of the
> > > > device didn't masquerade this regions as a standard BAR. In fact it
> > > > also reminds me of NVlink2 coherent RAM on POWER machines that was
> > > > similarly handled as device specific regions.
> > >
> > > It wasn't so good on POWER and if some of that stuff has been done
> > > more generally we would have been further ahead here..
> >
> > Specifics? Nothing here explained why masquerading the coherent memory
> > as a BAR in the vfio-pci ABI is anything more than a hack that QEMU
> > could assemble on its own with a device specific region. Thanks,
>
> Well, we deleted all the POWER stuff and got nothing general out of
> it. Isn't that enough to say it was a bad idea? Here we are again with
> the same basic CXLish hardware design and the answer should not be
> keep making more vendor ABI.
>
> I think this is the direction things are trending in the
> industry. There is nothing particuarlly special, and certainly nothing
> *nvidia specific* about these things.
>
> So lets find a way to give these things appropriate generic names at
> the ABI level please..

What is the generic feature that "these things" implement?

There's a lot of vendor specific things going on here. Not only is all
that "weird Grace FW ACPI stuff" based on this region, but also if we
are exposing it as a BAR, which BAR index(s) for a given device.

If "the industry" does come out with a spec for "these things",
couldn't QEMU optionally plug a device specific region into the
implementation of that spec, potentially along with some commonly
defined region to make this device appear to honor that new spec?
Versus with the in-kernel variant driver masquerading the BAR, we're
stuck with what the kernel implements. Would future hardware
implementing to this new spec require a variant driver or would we
extend vfio-pci to support them with extended regions and expect the
VMM to compose them appropriately? Thanks,

Alex