Re: [PATCH] KVM: arm64: Correctly handle the mmio faulting

From: Gavin Shan
Date: Thu Apr 22 2021 - 19:38:55 EST


Hi Marc,

On 4/22/21 4:50 PM, Marc Zyngier wrote:
On Thu, 22 Apr 2021 03:02:00 +0100,
Gavin Shan <gshan@xxxxxxxxxx> wrote:
On 4/21/21 9:59 PM, Marc Zyngier wrote:
On Wed, 21 Apr 2021 07:17:44 +0100,
Keqian Zhu <zhukeqian1@xxxxxxxxxx> wrote:
On 2021/4/21 14:20, Gavin Shan wrote:
On 4/21/21 12:59 PM, Keqian Zhu wrote:
On 2020/10/22 0:16, Santosh Shukla wrote:
The Commit:6d674e28 introduces a notion to detect and handle the
device mapping. The commit checks for the VM_PFNMAP flag is set
in vma->flags and if set then marks force_pte to true such that
if force_pte is true then ignore the THP function check
(/transparent_hugepage_adjust()).

There could be an issue with the VM_PFNMAP flag setting and checking.
For example consider a case where the mdev vendor driver register's
the vma_fault handler named vma_mmio_fault(), which maps the
host MMIO region in-turn calls remap_pfn_range() and maps
the MMIO's vma space. Where, remap_pfn_range implicitly sets
the VM_PFNMAP flag into vma->flags.
Could you give the name of the mdev vendor driver that triggers this issue?
I failed to find one according to your description. Thanks.


I think it would be fixed in driver side to set VM_PFNMAP in
its mmap() callback (call_mmap()), like vfio PCI driver does.
It means it won't be delayed until page fault is issued and
remap_pfn_range() is called. It's determined from the beginning
that the vma associated the mdev vendor driver is serving as
PFN remapping purpose. So the vma should be populated completely,
including the VM_PFNMAP flag before it becomes visible to user
space.

Why should that be a requirement? Lazy populating of the VMA should be
perfectly acceptable if the fault can only happen on the CPU side.


It isn't a requirement and the drivers needn't follow strictly. I checked
several drivers before looking into the patch and found almost all the
drivers have VM_PFNMAP set at mmap() time. In drivers/vfio/vfio-pci.c,
there is a comment as below, but it doesn't reveal too much about why
we can't set VM_PFNMAP at fault time.

static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
{
:
/*
* See remap_pfn_range(), called from vfio_pci_fault() but we can't
* change vm_flags within the fault handler. Set them now.
*/
vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
vma->vm_ops = &vfio_pci_mmap_ops;

return 0;
}

To set these flags in advance does have advantages. For example,
VM_DONTEXPAND prevents the vma to be merged with another
one. VM_DONTDUMP make this vma isn't eligible for
coredump. Otherwise, the address space, which is associated with the
vma is accessed and unnecessary page faults are triggered on
coredump. VM_IO and VM_PFNMAP avoids to walk the page frames
associated with the vma since we don't have valid PFN in the
mapping.

But PCI clearly isn't the case we are dealing with here, and not
everything is VFIO either. I can *today* create a driver that
implements a mmap+fault handler, call mmap() on it, pass the result to
a memslot, and get to the exact same result Santosh describes.

No PCI, no VFIO, just a random driver. We are *required* to handle
that.


hmm, ok. I was thinking it's related to VFIO mdev driver when Santosh was
talking about "mdev driver". Anyway, it's always nice to support the case :)

Thanks,
Gavin