Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem

From: Andres Lagar-Cavilla
Date: Thu Sep 25 2014 - 17:50:35 EST


On Thu, Sep 25, 2014 at 2:16 PM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
> Hi Andres,
>
> On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
>> + if (!locked) {
>> + VM_BUG_ON(npages != -EBUSY);
>> +
>
> Shouldn't this be VM_BUG_ON(npages)?

Oh shoot you're right. I was confused by the introduction of -EBUSY in
the forward port.

if (ret & VM_FAULT_RETRY) {
if (nonblocking)
*nonblocking = 0;
return -EBUSY;
}

(gaaah!!!)

>
> Alternatively we could patch gup to do:
>
> case -EHWPOISON:
> + case -EBUSY:
> return i ? i : ret;
> - case -EBUSY:
> - return i;
>

No preference. Not a lot of semantics available given that we pass 1
as the count to gup. Want to cut the patch or I can just shoot one
right away?

> I need to fix gup_fast slow path to start with FAULT_FLAG_ALLOW_RETRY
> similarly to what you did to the KVM slow path.
>
> gup_fast is called without the mmap_sem (incidentally its whole point
> is to only disable irqs and not take the locks) so the enabling of
> FAULT_FLAG_ALLOW_RETRY initial pass inside gup_fast should be all self
> contained. It shouldn't concern KVM which should be already fine with
> your patch, but it will allow the userfaultfd to intercept all
> O_DIRECT gup_fast in addition to KVM with your patch.
>
> Eventually get_user_pages should be obsoleted in favor of
> get_user_pages_locked (or whoever we decide to call it) so the
> userfaultfd can intercept all kind of gups. gup_locked is same as gup
> except for one more "locked" parameter at the end, I called the
> parameter locked instead of nonblocking because it'd be more proper to
> call "nonblocking" gup the FOLL_NOWAIT kind which is quite the
> opposite (in fact as the mmap_sem cannot be dropped in the non
> blocking version).
>

It's nearly impossible to name it right because 1) it indicates we can
relinquish 2) it returns whether we still hold the mmap semaphore.

I'd prefer it'd be called mmap_sem_hold, which conveys immediately
what this is about ("nonblocking" or "locked" could be about a whole
lot of things)

> ptrace ironically is better off sticking with a NULL locked parameter
> and to get a sigbus instead of risking hanging on the userfaultfd
> (which would be safe as it can be killed, but it'd be annoying if
> erroneously poking into a hole during a gdb session). It's still
> possible to pass NULL as parameter to get_user_pages_locked to achieve
> that. So the fact some callers won't block in handle_userfault because
> FAULT_FLAG_ALLOW_RETRY is not set and the userfault cannot block, may
> come handy.
>
> What I'm trying to solve in this context is that the userfault cannot
> safely block without FAULT_FLAG_ALLOW_RETRY because we can't allow
> userland to take the mmap_sem for an unlimited amount of time without
> requiring special privileges, so if handle_userfault wants to blocks
> within a gup invocation, it must first release the mmap_sem hence
> FAULT_FLAG_ALLOW_RETRY is always required at the first attempt for any
> virtual address.

I can see that. My background for coming into this is very similar: in
a previous life we had a file system shim that would kick up into
userspace for servicing VM memory. KVM just wouldn't let the file
system give up the mmap semaphore. We had /proc readers hanging up all
over the place while userspace was servicing. Not happy.

With KVM (now) and the standard x86 fault giving you ALLOW_RETRY, what
stands in your way? Methinks that gup_fast has no slowpath fallback
that turns on ALLOW_RETRY. What would oppose that being the global
behavior?

>
> With regard to the last sentence, there's actually a race with
> MADV_DONTNEED too, I'd need to change the code to always pass
> FAULT_FLAG_ALLOW_RETRY (your code also would need to loop and
> insisting with the __get_user_pages(locked) version to solve it). The
> KVM ioctl worst case would get an -EFAULT if the race materializes for
> example. It's non concerning though because that can be solved in
> userland somehow by separating ballooning and live migration
> activities.

Well, IIUC every code path that has ALLOW_RETRY dives in the second
time with FAULT_TRIED or similar. In the common case, you happily
blaze through the second time, but if someone raced in while all locks
were given up, one pays the price of the second time being a full
fault hogging the mmap sem. At some point you need to not keep being
polite otherwise the task starves. Presumably the risk of an extra
retry drops steeply every new gup retry. Maybe just try three times is
good enough. It makes for ugly logic though.

Thanks
Andres

>
> Thanks,
> Andrea



--
Andres Lagar-Cavilla | Google Kernel Team | andreslc@xxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/