Re: [PATCH 03/27] drm/i915/gvt: Incorporate KVM memslot info into check for 2MiB GTT entry

From: Sean Christopherson
Date: Thu Jan 05 2023 - 12:40:46 EST


On Thu, Jan 05, 2023, Yan Zhao wrote:
> On Tue, Jan 03, 2023 at 09:13:54PM +0000, Sean Christopherson wrote:
> > On Wed, Dec 28, 2022, Yan Zhao wrote:
> > > On Fri, Dec 23, 2022 at 12:57:15AM +0000, Sean Christopherson wrote:
> > > > Honor KVM's max allowed page size when determining whether or not a 2MiB
> > > > GTT shadow page can be created for the guest. Querying KVM's max allowed
> > > > size is somewhat odd as there's no strict requirement that KVM's memslots
> > > > and VFIO's mappings are configured with the same gfn=>hva mapping, but
> > > Without vIOMMU, VFIO's mapping is configured with the same as KVM's
> > > memslots, i.e. with the same gfn==>HVA mapping
> >
> > But that's controlled by userspace, correct?
>
> Yes, controlled by QEMU.

...

> > Strictly speaking, no. E.g. if a 2MiB region is covered with multiple memslots
> > and the memslots have different properties.
> >
> > > If for some reason, KVM maps a 2MiB range in 4K sizes, KVMGT can still map
> > > it in IOMMU size in 2MiB size as long as the PFNs are continous and the
> > > whole range is all exposed to guest.
> >
> > I agree that practically speaking this will hold true, but if KVMGT wants to honor
> > KVM's memslots then checking that KVM allows a hugepage is correct. Hrm, but on
> > the flip side, KVMGT ignores read-only memslot flags, so KVMGT is already ignoring
> > pieces of KVM's memslots.
> KVMGT calls dma_map_page() with DMA_BIDIRECTIONAL after checking gvt_pin_guest_page().
> Though for a read-only memslot, DMA_TO_DEVICE should be used instead
> (see dma_info_to_prot()),
> as gvt_pin_guest_page() checks (IOMMU_READ | IOMMU_WRITE) permission for each page,
> it actually ensures that the pinned GFN is not in a read-only memslot.
> So, it should be fine.
>
> >
> > I have no objection to KVMGT defining its ABI such that KVMGT is allowed to create
> > 2MiB so long as (a) the GFN is contiguous according to VFIO, and (b) that the entire
> > 2MiB range is exposed to the guest.
> >
> sorry. I may not put it clearly enough.
> for a normal device pass-through via VFIO-PCI, VFIO maps IOMMU mappings in this way:
>
> (a) fault in PFNs in a GFN range within the same memslot (VFIO saves dma_list, which is
> the same as memslot list when vIOMMU is not on or not in shadow mode).
> (b) map continuous PFNs into iommu driver (honour ro attribute and can > 2MiB as long as
> PFNs are continuous).
> (c) IOMMU driver decides to map in 2MiB or in 4KiB according to its setting.
>
> For KVMGT, gvt_dma_map_page() first calls gvt_pin_guest_page() which
> (a) calls vfio_pin_pages() to check each GFN is within allowed dma_list with
> (IOMMU_READ | IOMMU_WRITE) permission and fault-in page.
> (b) checks PFNs are continuous in 2MiB,
>
> Though checking kvm_page_track_max_mapping_level() is also fine, it makes DMA
> mapping size unnecessarily smaller.

Yeah, I got all that. What I'm trying to say, and why I asked about whether or
not userspace controls the mappings, is that AFAIK there is nothing in the kernel
that coordinates mappings between VFIO and KVM. So, very technically, userspace
could map a 2MiB range contiguous in VFIO but not in KVM, or RW in VFIO but RO in KVM.

I can't imagine there's a real use case for doing so, and arguably there's no
requirement that KVMGT honor KVM's memslot. But because KVMGT taps into KVM's
page-tracking, KVMGT _does_ honor KVM's memslots to some extent because KVMGT
needs to know whether or not a given GFN can be write-protected.

I'm totally fine if KVMGT's ABI is that VFIO is the source of truth for mappings
and permissions, and that the only requirement for KVM memslots is that GTT page
tables need to be visible in KVM's memslots. But if that's the ABI, then
intel_gvt_is_valid_gfn() should be probing VFIO, not KVM (commit cc753fbe1ac4
("drm/i915/gvt: validate gfn before set shadow page entry").

In other words, pick either VFIO or KVM. Checking that X is valid according to
KVM and then mapping X through VFIO is confusing and makes assumptions about how
userspace configures KVM and VFIO. It works because QEMU always configures KVM
and VFIO as expected, but IMO it's unnecessarily fragile and again confusing for
unaware readers because the code is technically flawed.

On a related topic, ppgtt_populate_shadow_entry() should check the validity of the
gfn. If I'm reading the code correctly, checking only in ppgtt_populate_spt() fails
to handle the case where the guest creates a bogus mapping when writing an existing
GTT PT.

Combing all my trains of thought, what about this as an end state for this series?
(completely untested at this point). Get rid of the KVM mapping size checks,
verify the validity of the entire range being mapped, and add a FIXME to complain
about using KVM instead of VFIO to determine the validity of ranges.

static bool intel_gvt_is_valid_gfn(struct intel_vgpu *vgpu, unsigned long gfn,
enum intel_gvt_gtt_type type)
{
unsigned long nr_pages;

if (!vgpu->attached)
return false;

if (type == GTT_TYPE_PPGTT_PTE_64K_ENTRY)
nr_pages = I915_GTT_PAGE_SIZE_64K >> PAGE_SHIFT;
else if (type == GTT_TYPE_PPGTT_PTE_2M_ENTRY)
nr_pages = I915_GTT_PAGE_SIZE_2M >> PAGE_SHIFT;
else
nr_pages = 1;

/*
* FIXME: Probe VFIO, not KVM. VFIO is the source of truth for KVMGT
* mappings and permissions, KVM's involvement is purely to handle
* write-tracking of GTT page tables.
*/
return kvm_page_track_is_contiguous_gfn_range(vgpu->vfio_device.kvm,
gfn, nr_pages);
}

static int try_map_2MB_gtt_entry(struct intel_vgpu *vgpu, unsigned long gfn,
dma_addr_t *dma_addr)
{
if (!HAS_PAGE_SIZES(vgpu->gvt->gt->i915, I915_GTT_PAGE_SIZE_2M))
return 0;

return intel_gvt_dma_map_guest_page(vgpu, gfn,
I915_GTT_PAGE_SIZE_2M, dma_addr);
}

static int ppgtt_populate_shadow_entry(struct intel_vgpu *vgpu,
struct intel_vgpu_ppgtt_spt *spt, unsigned long index,
struct intel_gvt_gtt_entry *ge)
{
const struct intel_gvt_gtt_pte_ops *pte_ops = vgpu->gvt->gtt.pte_ops;
dma_addr_t dma_addr = vgpu->gvt->gtt.scratch_mfn << PAGE_SHIFT;
struct intel_gvt_gtt_entry se = *ge;
unsigned long gfn;
int ret;

if (!pte_ops->test_present(ge))
goto set_shadow_entry;

gfn = pte_ops->get_pfn(ge);
if (!intel_gvt_is_valid_gfn(vgpu, gfn, ge->type))
goto set_shadow_entry;

...


set_shadow_entry:
pte_ops->set_pfn(&se, dma_addr >> PAGE_SHIFT);
ppgtt_set_shadow_entry(spt, &se, index);
return 0;
}