Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

From: Jason Gunthorpe
Date: Wed Jan 03 2024 - 14:34:16 EST


On Wed, Jan 03, 2024 at 11:00:16AM -0700, Alex Williamson wrote:
> On Wed, 3 Jan 2024 12:57:27 -0400
> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> > On Tue, Jan 02, 2024 at 09:10:01AM -0700, Alex Williamson wrote:
> >
> > > Yes, it's possible to add support that these ranges honor the memory
> > > enable bit, but it's not trivial and unfortunately even vfio-pci isn't
> > > a great example of this.
> >
> > We talked about this already, the HW architects here confirm there is
> > no issue with reset and memory enable. You will get all 1's on read
> > and NOP on write. It doesn't need to implement VMA zap.
>
> We talked about reset, I don't recall that we discussed that coherent
> and uncached memory ranges masquerading as PCI BARs here honor the
> memory enable bit in the command register.

Why do it need to do anything special? If the VM read/writes from
memory that the master bit is disabled on it gets undefined
behavior. The system doesn't crash and it does something reasonable.

> > > around device reset or relative to the PCI command register. The
> > > variant driver becomes a trivial implementation that masks BARs 2 & 4
> > > and exposes the ACPI range as a device specific region with only mmap
> > > support. QEMU can then map the device specific region into VM memory
> > > and create an equivalent ACPI table for the guest.
> >
> > Well, no, probably not. There is an NVIDIA specification for how the
> > vPCI function should be setup within the VM and it uses the BAR
> > method, not the ACPI.
>
> Is this specification available? It's a shame we've gotten this far
> without a reference to it.

No, at this point it is internal short form only.

> > There are a lot of VMMs and OSs this needs to support so it must all
> > be consistent. For better or worse the decision was taken for the vPCI
> > spec to use BAR not ACPI, in part due to feedback from the broader VMM
> > ecosystem, and informed by future product plans.
> >
> > So, if vfio does special regions then qemu and everyone else has to
> > fix it to meet the spec.
>
> Great, this is the sort of justification and transparency that had not
> been previously provided. It is curious that only within the past
> couple months the device ABI changed by adding the uncached BAR, so
> this hasn't felt like a firm design.

That is to work around some unfortunate HW defect, and is connected to
another complex discussion about how ARM memory types need to
work. Originally people hoped this could simply work transparently but
it eventually became clear it was not possible for the VM to degrade
from cachable without VMM help. Thus a mess was born..

> Also I believe it's been stated that the driver supports both the
> bare metal representation of the device and this model where the
> coherent memory is mapped as a BAR, so I'm not sure what obstacles
> remain or how we're positioned for future products if take the bare
> metal approach.

It could work, but it is not really the direction that was decided on
for the vPCI functions.

> > I thought all the meaningful differences are fixed now?
> >
> > The main remaining issue seems to be around the config space
> > emulation?
>
> In the development of the virtio-vfio-pci variant driver we noted that
> r/w access to the IO BAR didn't honor the IO bit in the command
> register, which was quickly remedied and now returns -EIO if accessed
> while disabled. We were already adding r/w support to the coherent BAR
> at the time as vfio doesn't have a means to express a region as only
> having mmap support and precedent exists that BAR regions must support
> these accesses. So it was suggested that r/w accesses should also
> honor the command register memory enable bit, but of course memory BARs
> also support mmap, which snowballs into a much more complicated problem
> than we have in the case of the virtio IO BAR.

I think that has just become too pedantic, accessing the regions with
the enable bits turned off is broadly undefined behavior. So long as
the platform doesn't crash, I think it is fine to behave in a simple
way.

There is no use case for providing stronger emulation of this.

> So where do we go? Do we continue down the path of emulating full PCI
> semantics relative to these emulated BARs? Does hardware take into
> account the memory enable bit of the command register? Do we
> re-evaluate the BAR model for a device specific region?

It has to come out as a BAR in the VM side so all of this can't be
avoided. The simple answer is we don't need to care about enables
because there is no reason to care. VMs don't write to memory with the
enable turned off because on some platforms you will crash the system
if you do that.

> I'd suggest we take a look at whether we need to continue to pursue
> honoring the memory enable bit for these BARs and make a conscious and
> documented decision if we choose to ignore it.

Let's document it.

> Ideally we could also make this shared spec that we're implementing
> to available to the community to justify the design decisions here.
> In the case of GPUDirect Cliques we had permission to post the spec
> to the list so it could be archived to provide a stable link for
> future reference. Thanks,

Ideally. I'll see that people consider it at least.

Jason