Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cachecoherency

From: Alexey Kardashevskiy
Date: Thu Sep 26 2013 - 00:31:39 EST


On 09/26/2013 06:19 AM, Alex Williamson wrote:
> On Sun, 2013-09-15 at 22:40 +1000, Alexey Kardashevskiy wrote:
>> On 09/14/2013 02:25 AM, Alex Williamson wrote:
>>> On Fri, 2013-09-13 at 18:49 +1000, Alexey Kardashevskiy wrote:
>>>> On 09/13/2013 07:23 AM, Alex Williamson wrote:
>>>>> So far we've succeeded at making KVM and VFIO mostly unaware of each
>>>>> other, but there's any important point where that breaks down. Intel
>>>>> VT-d hardware may or may not support snoop control. When snoop
>>>>> control is available, intel-iommu promotes No-Snoop transactions on
>>>>> PCIe to be cache coherent. That allows KVM to handle things like the
>>>>> x86 WBINVD opcode as a nop. When the hardware does not support this,
>>>>> KVM must implement a hardware visible WBINVD for the guest.
>>>>>
>>>>> We could simply let userspace tell KVM how to handle WBINVD, but it's
>>>>> privileged for a reason. Allowing an arbitrary user to enable
>>>>> physical WBINVD gives them a more access to the hardware. Previously,
>>>>> this has only been enabled for guests supporting legacy PCI device
>>>>> assignment. In such cases it's necessary for proper guest execution.
>>>>> We therefore create a new KVM-VFIO virtual device. The user can add
>>>>> and remove VFIO groups to this device via file descriptors. KVM
>>>>> makes use of the VFIO external user interface to validate that the
>>>>> user has access to physical hardware and gets the coherency state of
>>>>> the IOMMU from VFIO. This provides equivalent functionality to
>>>>> legacy KVM assignment, while keeping (nearly) all the bits isolated.
>>>>>
>>>>> The one intrusion is the resulting flag indicating the coherency
>>>>> state. For this RFC it's placed on the x86 kvm_arch struct, however
>>>>> I know POWER has interest in using the VFIO external user interface,
>>>>> and I'm hoping we can share a common KVM-VFIO device. Perhaps they
>>>>> care about No-Snoop handling as well or the code can be #ifdef'd.
>>>>
>>>>
>>>> POWER does not support (at least boos3s - "server", not sure about others)
>>>> this cache-non-coherent stuff at all.
>>>
>>> Then it's easy for your IOMMU API interface to return always cache
>>> coherent or never cache coherent or whatever ;)
>>>
>>>> Regarding reusing this device with external API for POWER - I posted a
>>>> patch which introduces KVM device to link KVM with IOMMU but besides the
>>>> list of groups registered in KVM, it also provides the way to find a group
>>>> by LIOBN (logical bus number) which is used in DMA map/unmap hypercalls. So
>>>> in my case kvm_vfio_group struct needs LIOBN and it would be nice to have
>>>> there window_size too (for a quick boundary check). I am not sure we want
>>>> to mix everything here.
>>>>
>>>> It is in "[PATCH v10 12/13] KVM: PPC: Add support for IOMMU in-kernel
>>>> handling" if you are interested (kvmppc_spapr_tce_iommu_device).
>>>
>>> Yes, I stole the code to get the vfio symbols from your code. The
>>> convergence I was hoping to achieve is that KVM doesn't really want to
>>> know about VFIO and vica versa. We can therefore at least limit the
>>> intrusion by sharing a common device. Obviously for you it will need
>>> some extra interfaces to associate an LIOBN to a group, but we keep both
>>> the kernel an userspace cleaner by avoiding duplication where we can.
>>> Is this really not extensible to your usage?
>>
>> Well, I do not have a good taste for this kind of stuff so I cannot tell
>> for sure. I can reuse this device and hack it to do whatever I need but how?
>>
>> There are 2 things I need from KVM device:
>> 1. associate IOMMU group with LIOBN
>
> Does QEMU know this? We could set another attribute to specify the
> LIOBN for a group.

QEMU knows as QEMU decides on LOIBN number. And yes, we could do that.


>> 2. get a pointer to an IOMMU group by LIOBN (used to get ppc's IOMMU table
>> pointer which is an IOMMU data of an IOMMU group so we could take a
>> shortcut here).
>
> Once we have a VFIO device with a VFIO group added and the LIOBN
> attribute set, isn't this just a matter of some access code?


The lookup function will be called from what we call a realmode on PPC64,
i.e. when MMU is off and kvm.ko code is not available. So we will either
have to put this lookup function to a separate virt/kvm/vfio_rm.c or
compile the whole thing into the kernel image but this is not a big issue
anyway.

You can have a look for example at book3s_64_vio_hv.o vs. book3s_64_vio.o
to get a better picture if you like.


>> There are 3 possible interfaces to use:
>> A. set/get attributes
>> B. ioctl
>> C. additional API
>
> I think we need to differentiate user interfaces from kernel interfaces.
> Within the kernel, we make no guarantees about interfaces and we can
> change them to meet our needs. That includes the vfio external user
> interface. For userspace, we need to be careful not to break things. I
> suggest we use the set/get/has attribute interface that's already part
> of KVM for the user interface.
>
>> What I posted a week ago uses A for 1 and C for 2. Now we move this to
>> virt/kvm/vfio.c.
>
> I don't care where it lives, other than we both have a need for it, so
> it seems like the core of it should not live in architecture specific
> directories.
>
>> A for 1 is more or less ok but how exactly? Yet another attribute? Platform
>> specific "bus ID"? In your patch attr->addr is not really an address (to be
>> accessed with get_user()) but just an fd.
>
> Is that a problem?

Not for me but I have a bad taste :)


>> For 2 - there are already A and B interfaces so we do not want C, right?
>> What kind of a good device has a backdoor? :) But A and B do not have
>> access control to prevent the user space from receiving a IOMMU group
>> pointer (which it would not be able to use anyway but still). Do we care
>> about this (I do not)? And using B (ioctls) within the kernel - I cannot
>> say I saw many usages of this.
>
> For kernel internal things we don't want to invent new ioctls or use the
> KVM device get_attr interface. Note that I didn't provide a get_attr
> interface for anything there now. I think we just want to create a
> kernel internal interface, ie. function calls.
>
>> I am pretty sure we will spend some time (not much) arguing about these
>> things and when we agree on something, then some of KVM maintainers will
>> step in and say that there is 1001st way of doing this and - goto start.
>> And after all, this still won't be a device as it does not emulate anything
>> present in the real hardware, this is just yet another interface to KVM and
>> some ways of using it won't be natural for somebody. /me sighs.
>
> And thus this RFC with somewhat rough code to try to get buyin. Thanks,


Well, I do not see any big issue (including no-mmu real mode) with the VFIO
KVM device as you posted it.


--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/