Re: [PATCH] vfio/type1: Restore mapping performance with mdev support

From: Kirti Wankhede
Date: Thu Dec 15 2016 - 12:58:09 EST




On 12/15/2016 1:33 PM, Alex Williamson wrote:
> On Thu, 15 Dec 2016 12:05:35 +0530
> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
>
>> On 12/14/2016 2:28 AM, Alex Williamson wrote:
>>> As part of the mdev support, type1 now gets a task reference per
>>> vfio_dma and uses that to get an mm reference for the task while
>>> working on accounting. That's the correct thing to do for paths
>>> where we can't rely on using current, but there are still hot paths
>>> where we can optimize because we know we're invoked by the user.
>>>
>>> Specifically, vfio_pin_pages_remote() is only called when the user
>>> does DMA mapping (vfio_dma_do_map) or if an IOMMU group is added to
>>> a container with existing mappings (vfio_iommu_replay). We can
>>> therefore use current->mm as well as rlimit() and capable() directly
>>> rather than going through the high overhead path via the stored
>>> task_struct. We also know that vfio_dma_do_unmap() is only called
>>> via user ioctl, so we can also tune that path to be more lightweight.
>>>
>>> In a synthetic guest mapping test emulating a 1TB VM backed by a
>>> single 4GB range remapped multiple times across the address space,
>>> the mdev changes to the type1 backend introduced a roughly 25% hit
>>> in runtime of this test. These changes restore it to nearly the
>>> previous performance for the interfaces exercised here,
>>> VFIO_IOMMU_MAP_DMA and release on close.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
>>> ---
>>> drivers/vfio/vfio_iommu_type1.c | 145 +++++++++++++++++++++------------------
>>> 1 file changed, 79 insertions(+), 66 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index 9815e45..8dfeafb 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -103,6 +103,10 @@ struct vfio_pfn {
>>> #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \
>>> (!list_empty(&iommu->domain_list))
>>>
>>> +/* Make function bool options readable */
>>> +#define IS_CURRENT (true)
>>> +#define DO_ACCOUNTING (true)
>>> +
>>> static int put_pfn(unsigned long pfn, int prot);
>>>
>>> /*
>>> @@ -264,7 +268,8 @@ static void vfio_lock_acct_bg(struct work_struct *work)
>>> kfree(vwork);
>>> }
>>>
>>> -static void vfio_lock_acct(struct task_struct *task, long npage)
>>> +static void vfio_lock_acct(struct task_struct *task,
>>> + long npage, bool is_current)
>>> {
>>> struct vwork *vwork;
>>> struct mm_struct *mm;
>>> @@ -272,24 +277,31 @@ static void vfio_lock_acct(struct task_struct *task, long npage)
>>> if (!npage)
>>> return;
>>>
>>> - mm = get_task_mm(task);
>>> + mm = is_current ? task->mm : get_task_mm(task);
>>> if (!mm)
>>> - return; /* process exited or nothing to do */
>>> + return; /* process exited */
>>>
>>> if (down_write_trylock(&mm->mmap_sem)) {
>>> mm->locked_vm += npage;
>>> up_write(&mm->mmap_sem);
>>> - mmput(mm);
>>> + if (!is_current)
>>> + mmput(mm);
>>> return;
>>> }
>>>
>>> + if (is_current) {
>>> + mm = get_task_mm(task);
>>> + if (!mm)
>>> + return;
>>> + }
>>> +
>>> /*
>>> * Couldn't get mmap_sem lock, so must setup to update
>>> * mm->locked_vm later. If locked_vm were atomic, we
>>> * wouldn't need this silliness
>>> */
>>> vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
>>> - if (!vwork) {
>>> + if (WARN_ON(!vwork)) {
>>> mmput(mm);
>>> return;
>>> }
>>> @@ -345,13 +357,13 @@ static int put_pfn(unsigned long pfn, int prot)
>>> }
>>>
>>> static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>>> - int prot, unsigned long *pfn)
>>> + int prot, unsigned long *pfn, bool is_current)
>>> {
>>> struct page *page[1];
>>> struct vm_area_struct *vma;
>>> int ret;
>>>
>>> - if (mm == current->mm) {
>>> + if (is_current) {
>>
>> With this change, if vfio_pin_page_external() gets called from QEMU
>> process context, for example in response to some BAR0 register access,
>> it will still fallback to slow path, get_user_pages_remote(). We don't
>> have to change this function. This path already takes care of taking
>> best possible path.
>>
>> That also makes me think, vfio_pin_page_external() uses task structure
>> to get mlock limit and capability. Expectation is mdev vendor driver
>> shouldn't pin all system memory, but if any mdev driver does that, then
>> that driver might see such performance impact. Should we optimize this
>> path if (dma->task == current)?
>
> Hi Kirti,
>
> I was actually trying to avoid the (task == current) test with this
> change because I wasn't sure how reliable it is. Is there a
> possibility that this test generates a false positive if current
> coincidentally matches our task and does that allow us the same
> opportunities for making use of current that we have when we know in a
> process context execution path? The above change makes this a more
> direct association. Can you show that inferring the process context is
> correct? Thanks,

We do hold the usage count of task structure, get_task_struct(current),
before saving its reference in dma->task which is released,
put_task_struct(), from vfio_remove_dma(). That makes sure that we have
a valid reference to task structure till we remove/free that dma
structure. Why would the check (dma->task == current) be false positive?
Vendor driver can call vfio_pin_pages() on access to some emulated
register from the same task who have mapped dma range, in that case this
check would be true.

Thanks,
Kirti