Re: Can VFIO pin only a specific region of guest mem when use pass through devices?

From: Simon Guo
Date: Tue Oct 30 2018 - 07:22:42 EST


On Tue, Oct 30, 2018 at 11:00:51AM +0800, Peter Xu wrote:
> On Mon, Oct 29, 2018 at 12:29:22PM -0600, Alex Williamson wrote:
> > On Mon, 29 Oct 2018 17:14:46 +0800
> > Jason Wang <jasowang@xxxxxxxxxx> wrote:
> >
> > > On 2018/10/29 上午10:42, Simon Guo wrote:
> > > > Hi,
> > > >
> > > > I am using network device pass through mode with qemu x86(-device vfio-pci,host=0000:xx:yy.z)
> > > > and “intel_iommu=on” in host kernel command line, and it shows the whole guest memory
> > > > were pinned(vfio_pin_pages()), viewed by the “top” RES memory output. I understand it is due
> > > > to device can DMA to any guest memory address and it cannot be swapped.
> > > >
> > > > However can we just pin a rang of address space allowed by iommu group of that device,
> > > > instead of pin whole address space? I do notice some code like vtd_host_dma_iommu().
> > > > Maybe there is already some way to enable that?
> > > >
> > > > Sorry if I missed some basics. I googled some but no luck to find the answer yet. Please
> > > > let me know if any discussion already raised on that.
> > > >
> > > > Any other suggestion will also be appreciated. For example, can we modify the guest network
> > > > card driver to allocate only from a specific memory region(zone), and qemu advises guest
> > > > kernel to only pin that memory region(zone) accordingly?
> > > >
> > > > Thanks,
> > > > - Simon
> > >
> > >
> > > One possible method is to enable IOMMU of VM.
> >
> > Right, making use of a virtual IOMMU in the VM is really the only way
> > to bound the DMA to some subset of guest memory, but vIOMMU usage by
> > the guest is optional on x86 and even if the guest does use it, it might
> > enable passthrough mode, which puts you back at the problem that all
> > guest memory is pinned with the additional problem that it might also
> > be accounted for once per assigned device and may hit locked memory
> > limits. Also, the DMA mapping and unmapping path with a vIOMMU is very
> > slow, so performance of the device in the guest will be abysmal unless
> > the use case is limited to very static mappings, such as userspace use
> > within the guest for nested assignment or perhaps DPDK use cases.
> >
> > Modifying the guest to only use a portion of memory for DMA sounds like
> > a quite intrusive option. There are certainly IOMMU models where the
> > IOMMU provides a fixed IOVA range, but creating dynamic mappings within
> > that range doesn't really solve anything given that it simply returns
> > us to a vIOMMU with slow mapping. A window with a fixed identity
> > mapping used as a DMA zone seems plausible, but again, also pretty
> > intrusive to the guest, possibly also to the drivers. Host IOMMU page
> > faulting can also help the pinned memory footprint, but of course
> > requires hardware support and lots of new code paths, many of which are
> > already being discussed for things like Scalable IOV and SVA. Thanks,
>
> Agree with Jason's and Alex's comments. One trivial additional: the
> whole guest RAM will possibly still be pinned for a very short period
> during guest system boot (e.g., when running guest BIOS) and before
> the guest kernel enables the vIOMMU for the assigned device since the
> bootup code like BIOS would still need to be able to access the whole
> guest memory.
>

Peter, Alex, Jason,
Thanks for your nice/detailed explanation.

BR,
- Simon