Re: [PATCH v10 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper

From: Jason Gunthorpe
Date: Mon Sep 18 2023 - 13:47:27 EST


On Mon, Sep 18, 2023 at 11:19:49AM -0600, Alex Williamson wrote:

> > > And the VMM usage is self inflicted because we insist on
> > > masquerading the coherent memory as a nondescript PCI BAR rather
> > > than providing a device specific region to enlighten the VMM to this
> > > unique feature.
> >
> > I see it as two completely seperate things.
> >
> > 1) VFIO and qemu creating a vPCI device. Here we don't need this
> > information.
> >
> > 2) This ACPI pxm stuff to emulate the bare metal FW.
> > Including a proposal for auto-detection what kind of bare metal FW
> > is being used.
> >
> > This being a poor idea for #2 doesn't jump to problems with #1, it
> > just says more work is needed on the ACPI PXM stuff.
>
> But I don't think we've justified why it's a good idea for #1. Does
> the composed vPCI device with coherent memory masqueraded as BAR2 have
> a stand alone use case without #2?

Today there is no SW that can operate that configuration. But that is
a purely SW in the VM problem.

Jonathan got it right here:

https://lore.kernel.org/all/20230915153740.00006185@xxxxxxxxxx/

If Linux in the VM wants to use certain Linux kernel APIs then the FW
must provision these empty nodes. Universally. It is a CXL problem as
well.

For instance I could hack up Linux and force it to create extra nodes
regardless of ACPI and then everything would be fine with #1 alone.

When/if Linux learns to dynmically create these things without relying
on FW then we don't need #2.

It is ugly, it is hack, but it is copying what real FW decided to do.

> My understanding based on these series is that the guest driver somehow
> carves up the coherent memory among a set of memory-less NUMA nodes
> (how to know how many?) created by the VMM and reported via the _DSD for
> the device. If this sort of configuration is a requirement for making
> use of the coherent memory, then what exactly becomes easier by the fact
> that it's exposed as a PCI BAR?

It is keeping two concerns seperate. The vPCI layer doesn't care about
any of this because it is a Linux problem. A coherent BAR is fine and
results in the least amount of special code everywhere.

The ACPI layer has to learn how to make this hack to support Linux.

I don't think we should dramatically warp the modeling of the VFIO
regions just to support auto detecting an ACPI hack.

> In fact, if it weren't a BAR I'd probably suggest that the whole
> configuration of this device should be centered around a new
> nvidia-gpu-mem object. That object could reference the ID of a
> vfio-pci device providing the coherent memory via a device specific
> region and be provided with a range of memory-less nodes created for
> its use. The object would insert the coherent memory range into the VM
> address space and provide the device properties to make use of it in
> the same way as done on bare metal.

How does that give auto configuration? The other thread mentions that
many other things need this too, like CXL and imagined coherent
virtio stuff?

Can we do the API you imagine more generically with any VFIO region
(even a normal BAR) providing the memory object?

> It seems to me that the PCI BAR representation of coherent memory is
> largely just a shortcut to getting it into the VM address space, but
> it's also leading us down these paths where the "pxm stuff" is invoked
> based on the device attached to the VM, which is getting a lot of
> resistance.

I don't like the idea of a dedicated memory region type, I think we
will have more of these than just one.

Some kind of flag on the vfio device indicating that PXM nodes (and
how many) should be auto created would be fine.

But if there is resistance to auto configuration I don't see how that
goes away just because we shift around the indicator to trigger
auto configuration??

Jason