Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

From: Ankit Agrawal
Date: Fri Feb 09 2024 - 04:20:41 EST


Thanks Kevin for the review. Comments inline.

>>
>> Note that the usemem memory is added by the VM Nvidia device driver [5]
>> to the VM kernel as memblocks. Hence make the usable memory size
>> memblock
>> aligned.
>
> Is memblock size defined in spec or purely a guest implementation choice?

The MEMBLOCK value is a hardwired and a constant ABI value between the GPU
FW and VFIO driver.

>>
>> If the bare metal properties are not present, the driver registers the
>> vfio-pci-core function pointers.
>
> so if qemu doesn't generate such property the variant driver running
> inside guest will always go to use core functions and guest vfio userspace
> will observe both resmem and usemem bars. But then there is nothing
> in field to prohibit mapping resmem bar as cacheable.
>
> should this driver check the presence of either ACPI property or
> resmem/usemem bars to enable variant function pointers?

Maybe I am missing something here; but if the ACPI property is absent,
the real physical BARs present on the device will be exposed by the
vfio-pci-core functions to the VM. So I think if the variant driver is ran
within the VM, it should not see the fake usemem and resmem BARs.

>> +config NVGRACE_GPU_VFIO_PCI
>> +     tristate "VFIO support for the GPU in the NVIDIA Grace Hopper
>> Superchip"
>> +     depends on ARM64 || (COMPILE_TEST && 64BIT)
>> +     select VFIO_PCI_CORE
>> +     help
>> +       VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
>> +       required to assign the GPU device using KVM/qemu/etc
>
> "assign the GPU device to userspace"

Ack, will make the change.

>> +
>> +/* Memory size expected as non cached and reserved by the VM driver */
>> +#define RESMEM_SIZE 0x40000000
>> +#define MEMBLK_SIZE 0x20000000
>
> also add a comment for MEMBLK_SIZE

Yes.

>> +
>> +struct nvgrace_gpu_vfio_pci_core_device {
>
> will nvgrace refer to a non-gpu device? if not probably all prefixes with
> 'nvgrace_gpu' can be simplified to 'nvgrace'.

nvgrace_gpu indicates the GPU associated with grace CPU here. That is a
one of the reason behind keeping the nomenclature as nvgrace_gpu. Calling
it nvgrace may not be apt as it is a CPU.

> btw following other variant drivers 'vfio' can be removed too.

Ack.

>> +
>> +/*
>> + * Both the usable (usemem) and the reserved (resmem) device memory
>> region
>> + * are 64b fake BARs in the VM. These fake BARs must respond
>
> s/VM/device/

I can make it clearer to something like "64b fake device BARs in the VM".

>> +     int ret;
>> +
>> +     ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
>> +     if (ret < 0)
>> +             return ret;
>
> here if core_read succeeds *ppos has been updated...

Thanks for pointing that out, will fix the code handling *ppos.

>> + * Read the data from the device memory (mapped either through ioremap
>> + * or memremap) into the user buffer.
>> + */
>> +static int
>> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device
>> *nvdev,
>> +                      char __user *buf, size_t mem_count, loff_t *ppos)
>> +{
>> +     unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>> +     u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
>> +     int ret;
>> +
>> +     /*
>> +      * Handle read on the BAR regions. Map to the target device memory
>> +      * physical address and copy to the request read buffer.
>> +      */
>
> duplicate with the earlier comment for the function.

Ack.

>> +/*
>> + * Read count bytes from the device memory at an offset. The actual device
>> + * memory size (available) may not be a power-of-2. So the driver fakes
>> + * the size to a power-of-2 (reported) when exposing to a user space driver.
>> + *
>> + * Reads extending beyond the reported size are truncated; reads starting
>> + * beyond the reported size generate -EINVAL; reads extending beyond the
>> + * actual device size is filled with ~0.
>
> slightly clearer to order the description: read starting beyond reported
> size, then read extending beyond device size, and read extending beyond
> reported size.

Yes, will make the change.

>> +      * exposing the rest (termed as usable memory and represented
>> using usemem)
>> +      * as cacheable 64b BAR (region 4 and 5).
>> +      *
>> +      *               devmem (memlength)
>> +      * |-------------------------------------------------|
>> +      * |                                           |
>> +      * usemem.phys/memphys                         resmem.phys
>
> there is no usemem.phys and resmem.phys

Silly miss. Will fix that.

>> +      */
>> +     nvdev->usemem.memphys = memphys;
>> +
>> +     /*
>> +      * The device memory exposed to the VM is added to the kernel by
>> the
>> +      * VM driver module in chunks of memory block size. Only the usable
>> +      * memory (usemem) is added to the kernel for usage by the VM
>> +      * workloads. Make the usable memory size memblock aligned.
>> +      */
>
> If memblock size is defined by hw spec then say so.
> otherwise this sounds a broken contract if it's a guest-decided value.

Answered above. Will update the comment to mention that.

>> +     if (check_sub_overflow(memlength, RESMEM_SIZE,
>> +                            &nvdev->usemem.memlength)) {
>> +             ret = -EOVERFLOW;
>> +             goto done;
>> +     }
>
> does resmem require 1G-aligned?

No, the requirement is it to be 1G+. No alignment restrictions there.

> if usemem.memlength becomes 0 then should return error too.

I suppose you mean if it becomes 0 after the subtraction above. We are
checking for the 0 size in the host ACPI table otherwise.