Re: [PATCH v2 4/4] vfio: vfio_iommu_type1: implement VFIO_IOMMU_INFO_CAPABILITIES

From: Alex Williamson
Date: Tue May 21 2019 - 11:02:12 EST


On Tue, 21 May 2019 11:14:38 +0200
Pierre Morel <pmorel@xxxxxxxxxxxxx> wrote:

> On 20/05/2019 20:23, Alex Williamson wrote:
> > On Mon, 20 May 2019 18:31:08 +0200
> > Pierre Morel <pmorel@xxxxxxxxxxxxx> wrote:
> >
> >> On 20/05/2019 16:27, Cornelia Huck wrote:
> >>> On Mon, 20 May 2019 13:19:23 +0200
> >>> Pierre Morel <pmorel@xxxxxxxxxxxxx> wrote:
> >>>
> >>>> On 17/05/2019 20:04, Pierre Morel wrote:
> >>>>> On 17/05/2019 18:41, Alex Williamson wrote:
> >>>>>> On Fri, 17 May 2019 18:16:50 +0200
> >>>>>> Pierre Morel <pmorel@xxxxxxxxxxxxx> wrote:
> >>>>>>
> >>>>>>> We implement the capability interface for VFIO_IOMMU_GET_INFO.
> >>>>>>>
> >>>>>>> When calling the ioctl, the user must specify
> >>>>>>> VFIO_IOMMU_INFO_CAPABILITIES to retrieve the capabilities and
> >>>>>>> must check in the answer if capabilities are supported.
> >>>>>>>
> >>>>>>> The iommu get_attr callback will be used to retrieve the specific
> >>>>>>> attributes and fill the capabilities.
> >>>>>>>
> >>>>>>> Currently two Z-PCI specific capabilities will be queried and
> >>>>>>> filled by the underlying Z specific s390_iommu:
> >>>>>>> VFIO_IOMMU_INFO_CAP_QFN for the PCI query function attributes
> >>>>>>> and
> >>>>>>> VFIO_IOMMU_INFO_CAP_QGRP for the PCI query function group.
> >>>>>>>
> >>>>>>> Other architectures may add new capabilities in the same way
> >>>>>>> after enhancing the architecture specific IOMMU driver.
> >>>>>>>
> >>>>>>> Signed-off-by: Pierre Morel <pmorel@xxxxxxxxxxxxx>
> >>>>>>> ---
> >>>>>>> Â drivers/vfio/vfio_iommu_type1.c | 122
> >>>>>>> +++++++++++++++++++++++++++++++++++++++-
> >>>>>>> Â 1 file changed, 121 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c
> >>>>>>> b/drivers/vfio/vfio_iommu_type1.c
> >>>>>>> index d0f731c..9435647 100644
> >>>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>>>>>> @@ -1658,6 +1658,97 @@ static int
> >>>>>>> vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
> >>>>>>> ÂÂÂÂÂ return ret;
> >>>>>>> Â }
> >>>>>>> +static int vfio_iommu_type1_zpci_fn(struct iommu_domain *domain,
> >>>>>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ struct vfio_info_cap *caps, size_t size)
> >>>>>>> +{
> >>>>>>> +ÂÂÂ struct vfio_iommu_type1_info_pcifn *info_fn;
> >>>>>>> +ÂÂÂ int ret;
> >>>>>>> +
> >>>>>>> +ÂÂÂ info_fn = kzalloc(size, GFP_KERNEL);
> >>>>>>> +ÂÂÂ if (!info_fn)
> >>>>>>> +ÂÂÂÂÂÂÂ return -ENOMEM;
> >>>>>>> +
> >>>>>>> +ÂÂÂ ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_ZPCI_FN,
> >>>>>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ &info_fn->response);
> >>>>>>
> >>>>>> What ensures that the 'struct clp_rsp_query_pci' returned from this
> >>>>>> get_attr remains consistent with a 'struct vfio_iommu_pci_function'?
> >>>>>> Why does the latter contains so many reserved fields (beyond simply
> >>>>>> alignment) for a user API? What fields of these structures are
> >>>>>> actually useful to userspace? Should any fields not be exposed to the
> >>>>>> user? Aren't BAR sizes redundant to what's available through the vfio
> >>>>>> PCI API? I'm afraid that simply redefining an internal structure as
> >>>>>> the API leaves a lot to be desired too. Thanks,
> >>>>>>
> >>>>>> Alex
> >>>>>>
> >>>>> Hi Alex,
> >>>>>
> >>>>> I simply used the structure returned by the firmware to be sure to be
> >>>>> consistent with future evolutions and facilitate the copy from CLP and
> >>>>> to userland.
> >>>>>
> >>>>> If you prefer, and I understand that this is the case, I can define a
> >>>>> specific VFIO_IOMMU structure with only the fields relevant to the user,
> >>>>> leaving future enhancement of the user's interface being implemented in
> >>>>> another kernel patch when the time has come.
> >
> > TBH, I had no idea that CLP is an s390 firmware interface and this is
> > just dumping that to userspace. The cover letter says:
> >
> > Using the PCI VFIO interface allows userland, a.k.a. QEMU, to
> > retrieve ZPCI specific information without knowing Z specific
> > identifiers like the function ID or the function handle of the zPCI
> > function hidden behind the PCI interface.
> >
> > But what does this allow userland to do and what specific pieces of
> > information do they need? We do have a case already where Intel
> > graphics devices have a table (OpRegion) living in host system memory
> > that we expose via a vfio region, so it wouldn't be unprecedented to do
> > something like this, but as Connie suggests, if we knew what was being
> > consumed here and why, maybe we could generalize it into something
> > useful for others.
>
> OK, sorry I try to explain better.
>
> 1) A short description, of zPCI functions and groups
>
> IN Z, PCI cards, leave behind an adapter between subchannels and PCI.
> We access PCI cards through 2 ways:
> - dedicated PCI instructions (pci_load/pci_store/pci/store_block)
> - DMA
> We receive events through
> - Adapter interrupts
> - CHSC events
>
> The adapter propose an IOMMU to protect the DMA
> and the interrupt handling goes through a MSIX like interface handled by
> the adapter.
>
> The architecture specific PCI do the interface between the standard PCI
> level and the zPCI function (PCI + DMA/IOMMU/Interrupt)
>
> To handle the communication through the "zPCI way" the CLP interface
> provides instructions to retrieve informations from the adapters.
>
> There are different group of functions having same functionalities.
>
> clp_list give us a list from zPCI functions
> clp_query_pci_function returns informations specific to a function
> clp_query_group returns information on a function group
>
>
> 2) Why do we need it in the guest
>
> We need to provide the guest with information on the adapters and zPCI
> functions returned by the clp_query instruction so that the guest's
> driver gets the right information on how the way to the zPCI function
> has been built in the host.
>
>
> When a guest issues the CLP instructions we intercept the clp command in
> QEMU and we need to feed the response with the right values for the guest.
> The "right" values are not the raw CLP response values:
>
> - some identifier must be virtualized, like UID and FID,
>
> - some values must match what the host received from the CLP response,
> like the size of the transmited blocks, the DMA Address Space Mask,
> number of interrupt, MSIA
>
> - some other must match what the host handled with the adapter and
> function, the start and end of DMA,
>
> - some what the host IOMMU driver supports (frame size),

This seems very reminiscent of virtualizing PCI config space... so why
is this being proposed as a VFIO IOMMU ioctl extension? These are all
function level characteristics, right? Should this be a capability on
the VFIO device, or perhaps a region like we used for the Intel
OpRegion (though the structure size seems more akin to a capability
here)? As I mentioned in my previous reply, tying this into the IOMMU
interface seemed to rely on (I assume) an one-to-one-to-one mapping of
PCI function to IOMMU group to IOMMU domain, but that doesn't still
doesn't necessarily lend itself to using the IOMMU for device level
information. If there is IOMMU info, perhaps it needs to be split, ie.
expose a frame size via domain_get_attr, expose device level features
via a device capability, let QEMU assemble these into something
coherent to emulate the clp interface.

> 3) We have three different way to get This information:
>
> The PCI Linux interface is a standard PCI interface and some Z specific
> information is available in sysfs.
> Not all the information needed to be returned inside the CLP response is
> available.
> So we can not use the sysfs interface to get all the information.
>
> There is a CLP ioctl interface but this interface is not secure in that
> it returns the information for all adapters in the system.
>
> The VFIO interface offers the advantage to point to a single PCI
> function, so more secure than the clp ioctl interface.
> Coupled with the s390_iommu we get access to the zPCI CLP instruction
> and to the values handled by the zPCI driver.
>
>
> 4) Until now we used to fill the CLP response to the guest inside QEMU
> with fixed values corresponding to the only PCI card we supported.
> To support new cards we need to get the right values from the kernel out.

If it's already emulated, I much prefer figuring out how to expose the
right pieces of information via an appropriate interface to virtualize
fields that are actually necessary rather than simply providing an
interface to dump the clp info straight to userspace and pipe it to the
VM. Thanks,

Alex