Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

From: Alex Williamson
Date: Tue Jan 02 2024 - 11:10:13 EST


On Thu, 21 Dec 2023 12:43:28 +0000
Ankit Agrawal <ankita@xxxxxxxxxx> wrote:

> Thanks Alex and Cedric for the review.
>
> >> +/*
> >> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> >> + */
> >> +
> >> +#include "nvgrace_gpu_vfio_pci.h"
> >
> > drivers/vfio/pci/nvgrace-gpu/main.c:6:10: fatal error: nvgrace_gpu_vfio_pci.h: No such file or directory
> >    6 | #include "nvgrace_gpu_vfio_pci.h"
>
> Yeah, managed to miss the file. Will fix that in the next version.
>
> >> +
> >> +static bool nvgrace_gpu_vfio_pci_is_fake_bar(int index)
> >> +{
> >> +     return (index == RESMEM_REGION_INDEX ||
> >> +         index == USEMEM_REGION_INDEX);
> >> +}
> >
> > Presumably these macros are defined in the missing header, though we
> > don't really need a header file just for that.  This doesn't need to be
> > line wrapped, it's short enough with the macros as is.
>
> Yeah that and the structures are moved to the header file.
>
> >> +     info.flags = VFIO_REGION_INFO_FLAG_READ |
> >> +             VFIO_REGION_INFO_FLAG_WRITE |
> >> +             VFIO_REGION_INFO_FLAG_MMAP;
> >
> > Align these all:
> >
> >        info.flags = VFIO_REGION_INFO_FLAG_READ |
> >                    VFIO_REGION_INFO_FLAG_WRITE |
> >                     VFIO_REGION_INFO_FLAG_MMAP;
>
> Ack.
>
> >> +
> >> +static bool range_intersect_range(loff_t range1_start, size_t count1,
> >> +                               loff_t range2_start, size_t count2,
> >> +                               loff_t *start_offset,
> >> +                               size_t *intersect_count,
> >> +                               size_t *register_offset)
> >
> > We should put this somewhere shared with virtio-vfio-pci.
>
> Yeah, will move to vfio_pci_core.c
>
> >> +
> >> +     if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64),
> >> +                               &copy_offset, &copy_count, &register_offset)) {
> >> +             bar_size = roundup_pow_of_two(nvdev->resmem.memlength);
> >> +             nvdev->resmem.u64_reg &= ~(bar_size - 1);
> >> +             nvdev->resmem.u64_reg |= PCI_BASE_ADDRESS_MEM_TYPE_64 |
> >> +                                      PCI_BASE_ADDRESS_MEM_PREFETCH;
> >> +             val64 = cpu_to_le64(nvdev->resmem.u64_reg);
> >
> > As suggested and implemented in virtio-vfio-pci, store the value as
> > little endian, then the write function simply becomes a
> > copy_from_user(), we only need a cpu native representation of the value
> > on read.
>
> Ack.
>
> >> +
> >> +     if (range_intersect_range(pos, count, PCI_COMMAND, sizeof(val16),
> >> +                               &copy_offset, &copy_count, &register_offset)) {
> >> +             if (copy_from_user((void *)&val16, buf + copy_offset, copy_count))
> >> +                     return -EFAULT;
> >> +
> >> +             if (le16_to_cpu(val16) & PCI_COMMAND_MEMORY)
> >> +                     nvdev->bars_disabled = false;
> >> +             else
> >> +                     nvdev->bars_disabled = true;
> >
> > nvdev->bars_disabled = !(le16_to_cpu(val16) & PCI_COMMAND_MEMORY);
> >
> > But you're only tracking COMMAND_MEM relative to user writes, memory
> > will be disabled on reset and should not initially be enabled.
>
> I suppose you are suggesting we disable during reset and not enable until
> VM driver does so through PCI_COMMAND?

Unless we're working with a VF, MMIO space of a physical device is
governed by the the memory enable bit in the command register, so how
do we justify that these manufactured BARs don't have standard physical
function semantics, ie. they're accessible regardless of the command
register? Further, if we emulate any of those semantics, we should
emulate all of those semantics, r/w as well as mapped access.

> > But then also, isn't this really just an empty token of pretending this
> > is a PCI BAR if only the read/write and not mmap path honor the device
> > memory enable bit?  We'd need to zap and restore vmas mapping these
> > BARs if this was truly behaving as a PCI memory region.
>
> I can do that change, but for my information, is this a requirement to be
> PCI compliant?

I think it all falls into the justification of why this variant driver
chooses to expose these coherent and non-coherent ranges as PCI BARs.
A PCI BAR would follow the semantics that its accessibility is governed
by the memory enable bit. The physical platform exposes these as
ranges exposed through ACPI, maybe in part exactly because these ranges
live outside of the PCI device semantics.

> > We discussed previously that access through the coherent memory is
> > unaffected by device reset, is that also true of this non-cached BAR as
> > well?
>
> Actually, the coherent memory is not entirely unaffected during device reset.
> It becomes unavailable (and read access returns ~0) for a brief time during
> the reset. The non-cached BAR behaves in the same way as they are just
> as it is just a carved out part of device memory.

So the reset behavior is at least safe, but in tying them to a PCI BAR
we don't really honor the memory enable bit.

> > TBH, I'm still struggling with the justification to expose these memory
> > ranges as BAR space versus attaching them as a device specific region
> > where QEMU would map them into the VM address space and create ACPI
> > tables to describe them to reflect the same mechanism in the VM as used
> > on bare metal.  AFAICT the justification boils down to avoiding work in
> > QEMU and we're sacrificing basic PCI semantics and creating a more
> > complicated kernel driver to get there.  Let's have an on-list
> > discussion why this is the correct approach.
>
> Sorry it isn't clear to me how we are sacrificing PCI semantics here. What
> features are we compromising (after we fix the ones you pointed out above)?
>
> And if we managed to make these fake BARs PCI compliant, I suppose the
> primary objection is the additional code that we added to make it compliant?

Yes, it's possible to add support that these ranges honor the memory
enable bit, but it's not trivial and unfortunately even vfio-pci isn't
a great example of this. The fault handling there still has lockdep
issues and arguably has more significant mm issues. OTOH, there are no
fixed semantics or precedents for a device specific region. We could
claim that it only supports mmap and requires no special handling
around device reset or relative to the PCI command register. The
variant driver becomes a trivial implementation that masks BARs 2 & 4
and exposes the ACPI range as a device specific region with only mmap
support. QEMU can then map the device specific region into VM memory
and create an equivalent ACPI table for the guest.

I know Jason had described this device as effectively pre-CXL to
justify the coherent memory mapping, but it seems like there's still a
gap here that we can't simply hand wave that this PCI BAR follows a
different set of semantics. We don't typically endorse complexity in
the kernel only for the purpose of avoiding work in userspace. The
absolute minimum should be some justification of the design decision
and analysis relative to standard PCI behavior. Thanks,

Alex

> >> +     ret = nvgrace_gpu_map_device_mem(nvdev, index);
> >> +     if (ret)
> >> +             goto read_exit;
> >
> > We don't need a goto to simply return an error.
>
> Yes, will fix that.
>
> >> +     if (index == USEMEM_REGION_INDEX) {
> >> +             if (copy_to_user(buf, (u8 *)nvdev->usemem.bar_remap.memaddr + offset, mem_count))
> >> +                     ret = -EFAULT;
> >> +     } else {
> >> +             return do_io_rw(&nvdev->core_device, false, nvdev->resmem.bar_remap.ioaddr,
> >> +                             (char __user *) buf, offset, mem_count, 0, 0, false);
> >
> > The vfio_device_ops.read prototype defines buf as a char __user*, so
> > maybe look at why it's being passed as a void __user* rather than
> > casting.
>
> True, will fix that.
>
> >> +             /* Check if the bars are disabled, allow access otherwise */
> >> +             down_read(&nvdev->core_device.memory_lock);
> >> +             if (nvdev->bars_disabled) {
> >> +                     up_read(&nvdev->core_device.memory_lock);
> >> +                     return -EIO;
> >> +             }
> >
> > Why do we need bars_disabled here, or at all?  If we let do_io_rw()
> > test memory it would read the command register from vconfig and all of
> > this is redundant.
>
> Yes, and I will make use of the same flag to cover the
> USEMEM_REGION_INDEX cacheable device memory accesses.
>
> >> -static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
> >> -                     void __iomem *io, char __user *buf,
> >> -                     loff_t off, size_t count, size_t x_start,
> >> -                     size_t x_end, bool iswrite)
> >> +ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
> >> +               void __iomem *io, char __user *buf,
> >> +               loff_t off, size_t count, size_t x_start,
> >> +               size_t x_end, bool iswrite)
> >
> > This should be exported in a separate patch and renamed to be
> > consistent with other vfio-pci-core functions.
>
> Sure, and will rename with vfio_pci_core prefix.
>
> >> @@ -199,6 +199,7 @@ static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
> >>
> >>       return done;
> >>  }
> >> +EXPORT_SYMBOL(do_io_rw);
> >
> > NAK, _GPL.  Thanks,
>
> Yes, will make the change.
>
> >./scripts/checkpatch.pl --strict will give you some tips on how to
> improve the changes furthermore.
>
> Yes, will do that.
>