[PATCH 2/2] kvm: Faults which trigger IO release the mmap_sem

From: Andres Lagar-Cavilla
Date: Fri Sep 26 2014 - 12:36:56 EST


When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
has been swapped out or is behind a filemap, this will trigger async
readahead and return immediately. The rationale is that KVM will kick
back the guest with an "async page fault" and allow for some other
guest process to take over.

If async PFs are enabled the fault is retried asap from an async
workqueue. If not, it's retried immediately in the same code path. In
either case the retry will not relinquish the mmap semaphore and will
block on the IO. This is a bad thing, as other mmap semaphore users
now stall as a function of swap or filemap latency.

This patch ensures both the regular and async PF path re-enter the
fault allowing for the mmap semaphore to be relinquished in the case
of IO wait.

Reviewed-by: Radim KrÄmÃÅ <rkrcmar@xxxxxxxxxx>
Signed-off-by: Andres Lagar-Cavilla <andreslc@xxxxxxxxxx>
Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
---
include/linux/mm.h | 1 +
mm/gup.c | 4 ++++
virt/kvm/async_pf.c | 4 +---
virt/kvm/kvm_main.c | 4 ++--
4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 69f692d..71dbe03 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1997,6 +1997,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
+#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */

typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 19e17ab..369b3f6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
fault_flags |= FAULT_FLAG_ALLOW_RETRY;
if (*flags & FOLL_NOWAIT)
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
+ if (*flags & FOLL_TRIED) {
+ VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+ fault_flags |= FAULT_FLAG_TRIED;
+ }

ret = handle_mm_fault(mm, vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index d6a3d09..44660ae 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)

might_sleep();

- down_read(&mm->mmap_sem);
- get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
- up_read(&mm->mmap_sem);
+ get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
kvm_async_page_present_sync(vcpu, apf);

spin_lock(&vcpu->async_pf.lock);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 95519bc..921bce7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1170,8 +1170,8 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
addr, write_fault, page);
up_read(&current->mm->mmap_sem);
} else
- npages = get_user_pages_fast(addr, 1, write_fault,
- page);
+ npages = get_user_pages_unlocked(current, current->mm, addr, 1,
+ write_fault, 0, page);
if (npages != 1)
return npages;



This isn't bisectable in this order and it's untested anyway. It needs
more patchsplits.

This is just an initial RFC to know if it's ok to go into this
direction.

If it's ok I'll do some testing and submit it more properly. If your
patches goes in first it's fine and I'll just replace the call in KVM
to get_user_pages_unlocked (or whatever we want to call that thing).

I'd need to get this (or equivalent solution) merged before
re-submitting the userfaultfd patchset. I think the above benefits the
kernel as a whole in terms of mmap_sem holdtimes regardless of
userfaultfd so it should be good.

> Well, IIUC every code path that has ALLOW_RETRY dives in the second
> time with FAULT_TRIED or similar. In the common case, you happily
> blaze through the second time, but if someone raced in while all locks
> were given up, one pays the price of the second time being a full
> fault hogging the mmap sem. At some point you need to not keep being
> polite otherwise the task starves. Presumably the risk of an extra
> retry drops steeply every new gup retry. Maybe just try three times is
> good enough. It makes for ugly logic though.

I was under the idea that if one looped forever with VM_FAULT_RETRY
it'd eventually succeed, but it risks doing more work, so I'm also
sticking to the "locked != NULL" first, seek to the first page that
returned VM_FAULT_RETRY and issue a nr_pages=1 gup with locked ==
NULL, and then continue with locked != NULL at the next page. Just
like you did in the KVM slow path. And if "pages" array is NULL I bail
out at the first VM_FAULT_RETRY failure without insisting with further
gup calls of the "&locked" kind, your patch had just 1 page but you
also bailed out.

What this code above does is basically to generalize your optimization
to KVM and it makes it global and at the same time it avoids me
trouble in handle_userfault().

While at it I also converted some obvious candidate for gup_fast that
had no point in running slower (which I should split off in a separate
patch).

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/