Re: [PATCH 03/27] drm/i915/gvt: Incorporate KVM memslot info into check for 2MiB GTT entry

From: Yan Zhao
Date: Wed Jan 04 2023 - 22:31:32 EST


On Tue, Jan 03, 2023 at 09:13:54PM +0000, Sean Christopherson wrote:
> On Wed, Dec 28, 2022, Yan Zhao wrote:
> > On Fri, Dec 23, 2022 at 12:57:15AM +0000, Sean Christopherson wrote:
> > > Honor KVM's max allowed page size when determining whether or not a 2MiB
> > > GTT shadow page can be created for the guest. Querying KVM's max allowed
> > > size is somewhat odd as there's no strict requirement that KVM's memslots
> > > and VFIO's mappings are configured with the same gfn=>hva mapping, but
> > Without vIOMMU, VFIO's mapping is configured with the same as KVM's
> > memslots, i.e. with the same gfn==>HVA mapping
>
> But that's controlled by userspace, correct?

Yes, controlled by QEMU.
VFIO in kernel has no idea of whether vIOMMU is enabled or not.
KVMGT currently is known not working with vIOMMU with shadow mode on
(in this mode, VFIO maps gIOVA ==> HVA ==> HPA) .

>
> > > the check will be accurate if userspace wants to have a functional guest,
> > > and at the very least checking KVM's memslots guarantees that the entire
> > > 2MiB range has been exposed to the guest.
> >
> > I think just check the entrie 2MiB GFN range are all within KVM memslot is
> > enough.
>
> Strictly speaking, no. E.g. if a 2MiB region is covered with multiple memslots
> and the memslots have different properties.
>
> > If for some reason, KVM maps a 2MiB range in 4K sizes, KVMGT can still map
> > it in IOMMU size in 2MiB size as long as the PFNs are continous and the
> > whole range is all exposed to guest.
>
> I agree that practically speaking this will hold true, but if KVMGT wants to honor
> KVM's memslots then checking that KVM allows a hugepage is correct. Hrm, but on
> the flip side, KVMGT ignores read-only memslot flags, so KVMGT is already ignoring
> pieces of KVM's memslots.
KVMGT calls dma_map_page() with DMA_BIDIRECTIONAL after checking gvt_pin_guest_page().
Though for a read-only memslot, DMA_TO_DEVICE should be used instead
(see dma_info_to_prot()),
as gvt_pin_guest_page() checks (IOMMU_READ | IOMMU_WRITE) permission for each page,
it actually ensures that the pinned GFN is not in a read-only memslot.
So, it should be fine.

>
> I have no objection to KVMGT defining its ABI such that KVMGT is allowed to create
> 2MiB so long as (a) the GFN is contiguous according to VFIO, and (b) that the entire
> 2MiB range is exposed to the guest.
>
sorry. I may not put it clearly enough.
for a normal device pass-through via VFIO-PCI, VFIO maps IOMMU mappings in this way:

(a) fault in PFNs in a GFN range within the same memslot (VFIO saves dma_list, which is
the same as memslot list when vIOMMU is not on or not in shadow mode).
(b) map continuous PFNs into iommu driver (honour ro attribute and can > 2MiB as long as
PFNs are continuous).
(c) IOMMU driver decides to map in 2MiB or in 4KiB according to its setting.

For KVMGT, gvt_dma_map_page() first calls gvt_pin_guest_page() which
(a) calls vfio_pin_pages() to check each GFN is within allowed dma_list with
(IOMMU_READ | IOMMU_WRITE) permission and fault-in page.
(b) checks PFNs are continuous in 2MiB,

Though checking kvm_page_track_max_mapping_level() is also fine, it makes DMA
mapping size unnecessarily smaller.

> That said, being fully permissive also seems wasteful, e.g. KVM would need to
> explicitly support straddling multiple memslots.
>
> As a middle ground, what about tweaking kvm_page_track_is_valid_gfn() to take a
> range, and then checking that the range is contained in a single memslot?
>
> E.g. something like:
>
> bool kvm_page_track_is_contiguous_gfn_range(struct kvm *kvm, gfn_t gfn,
> unsigned long nr_pages)
> {
> struct kvm_memory_slot *memslot;
> bool ret;
> int idx;
>
> idx = srcu_read_lock(&kvm->srcu);
> memslot = gfn_to_memslot(kvm, gfn);
> ret = kvm_is_visible_memslot(memslot) &&
> gfn + nr_pages <= memslot->base_gfn + memslot->npages;
> srcu_read_unlock(&kvm->srcu, idx);
>
> return ret;
> }

Yes, it's good.
But as explained above, gvt_dma_map_page() checks in an equivalent way.
Maybe checking kvm_page_track_is_contiguous_gfn_range() is also not
required?
>
> > Actually normal device passthrough with VFIO-PCI also maps GFNs in a
> > similar way, i.e. maps a guest visible range in as large size as
> > possible as long as the PFN is continous.
> > >
> > > Note, KVM may also restrict the mapping size for reasons that aren't
> > > relevant to KVMGT, e.g. for KVM's iTLB multi-hit workaround or if the gfn
> > Will iTLB multi-hit affect DMA?
>
> I highly doubt it, I can't imagine an IOMMU would have a dedicated instruction
> TLB :-)
I can double check it with IOMMU hardware experts.
But if DMA would tamper instruction TLB, it should have been reported
as an issue with normal VFIO pass-through?

> > AFAIK, IOMMU mappings currently never sets exec bit (and I'm told this bit is
> > under discussion to be removed).