Re: [PATCH v3 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper

From: Jason Gunthorpe
Date: Tue Jun 06 2023 - 15:09:00 EST


On Tue, Jun 06, 2023 at 12:13:48PM -0600, Alex Williamson wrote:
> On Tue, 6 Jun 2023 14:16:42 -0300
> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> > On Tue, Jun 06, 2023 at 11:05:10AM -0600, Alex Williamson wrote:
> >
> > > It actually seems more complicated this way. We're masquerading this
> > > region as a BAR, but then QEMU needs to know based on device IDs that
> > > it's really not a BAR, it has special size properties, mapping
> > > attributes, error handling, etc.
> >
> > This seems like something has gone wrong then. ie the SIGUBS error
> > handling stuff should be totally generic in the qemu side. Mapping
> > attributes are set by the kernel, qemu shouldn't know, doesn't need to
> > know.
>
> You asked me to look at the v1 posting to see why there's so much more
> going on here than a quirk. That's what I read from the first public
> posting, a coherent memory region masqueraded as a BAR which requires
> different memory mapping and participates in ECC. I agree that the
> actual mapping is done by the kernel, but it doesn't really make a
> difference if that's a vfio-pci variant driver providing a different
> mmap callback for a BAR region or a device specific region handler.

The ECC stuff is generic too.

Even the non-power-2 size thing is *generic*, even if isn't following
PCI SIG.

> > The size issue is going to a be a problem in future anyhow, I expect
> > some new standards coming to support non-power-two sizes and they will
> > want to map to PCI devices in VMs still.
>
> Ok, but a PCI BAR has specific constraints and a non-power-of-2 BAR is
> not software compatible with those constraints. That's obviously not
> to say that a new capability couldn't expose arbitrary resources sizes
> on a PCI-like device though. I don't see how a non-power-of-2 BAR at
> this stage helps or fits within any spec, which is exactly what's
> being proposed through this BAR masquerade.

To emulate PCI, someone, somewhere, has to fix this mismatch up.

So given choices
1) Qemu sees a special NVIDIA thing and fixes it
2) Qemu sees a VFIO_PCI_BAR0_REGION with an odd size and fixes it
3) Kernel lies and makes a power-2 size and it fixes it

2 seems the most forward looking and reusable.

I definately don't think this is important enough to stick a vendor
label on it.

Broadly, we are looking toward a way for the kernel VFIO variant
driver to provide the majority of the "PCI emulation" and the VMM can
be general. It is not nice if every PCI emulation type driver needs
unique modifications to the VMM to support it. We are planning more
than one of these things already, and there are industry standards
afoot that will widly open the door here.

> > It seems OK to me if qemu can do this generically for any "BAR"
> > region, at least creating an entire "nvidia only" code path just for
> > non power 2 BAR sizing seems like a bad ABI choice.
>
> Have you looked at Ankit's QEMU series?

Not well, I haven't seen any versions of it till it was posted

> It's entirely NVIDIA-only code paths. Also nothing here precludes
> that shared code in QEMU might expose some known arbitrary sized
> regions as a BAR, or whatever spec defined thing allows that in the
> future.

It should not be like that. From a kernel perspective this is
basically all generic VMM code that can apply to any VFIO driver. The
kernel side was ment to be a general purpose API for the VMM.

The only special bit is emulating the weird Grace FW ACPI stuff.

> > > coherent memory which requires use of special mapping attributes vs a
> > > standard PCI BAR and participates in ECC. All of which seems like it
> > > would be easier to setup in QEMU if the vfio-pci representation of the
> > > device didn't masquerade this regions as a standard BAR. In fact it
> > > also reminds me of NVlink2 coherent RAM on POWER machines that was
> > > similarly handled as device specific regions.
> >
> > It wasn't so good on POWER and if some of that stuff has been done
> > more generally we would have been further ahead here..
>
> Specifics? Nothing here explained why masquerading the coherent memory
> as a BAR in the vfio-pci ABI is anything more than a hack that QEMU
> could assemble on its own with a device specific region. Thanks,

Well, we deleted all the POWER stuff and got nothing general out of
it. Isn't that enough to say it was a bad idea? Here we are again with
the same basic CXLish hardware design and the answer should not be
keep making more vendor ABI.

I think this is the direction things are trending in the
industry. There is nothing particuarlly special, and certainly nothing
*nvidia specific* about these things.

So lets find a way to give these things appropriate generic names at
the ABI level please..

Jason