Re: Using userfaultfd with KVM's async page fault handling causes processes to hung waiting for mmap_lock to be released

From: Peter Xu
Date: Thu Jul 20 2023 - 16:07:44 EST


Hello, Axel, Dimitris,

On Wed, Jul 19, 2023 at 02:54:21PM -0700, Axel Rasmussen wrote:
> On Wed, Jul 19, 2023 at 2:16 PM Axel Rasmussen <axelrasmussen@xxxxxxxxxx> wrote:
> >
> > Thanks for the detailed report Dimitris! I've CCed the MM mailing list and some
> > folks who work on userfaultfd.
>
> Apologies, I should have quoted the original message for the others I
> added to CC: https://lore.kernel.org/lkml/79375b71-db2e-3e66-346b-254c90d915e2@xxxxxxxxxxxxxxxxx/T/#u
>
> >
> > I took a look at this today, but I haven't quite come up with a solution.
> >
> > I thought it might be as easy as changing userfaultfd_release() to set released
> > *after* taking the lock. But no such luck, the ordering is what it is to deal
> > with another subtle case:
> >
> >
> > WRITE_ONCE(ctx->released, true);
> >
> > if (!mmget_not_zero(mm))
> > goto wakeup;
> >
> > /*
> > * Flush page faults out of all CPUs. NOTE: all page faults
> > * must be retried without returning VM_FAULT_SIGBUS if
> > * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
> > * changes while handle_userfault released the mmap_lock. So
> > * it's critical that released is set to true (above), before
> > * taking the mmap_lock for writing.
> > */
> > mmap_write_lock(mm);
> >
> > I think perhaps the right thing to do is to have handle_userfault() release
> > mmap_lock when it returns VM_FAULT_NOPAGE, and to have GUP deal with that
> > appropriately? But, some investigation is required to be sure that's okay to do
> > in the other non-GUP ways we can end up in handle_userfault().

Heh, this is also what I thought after reading. :)

If we see in the very early commit from Andrea it seems that would not hang
gup but just sigbus-ing it (see the comment that's mostly exactly the thing
Dimitris hit here):

commit 86039bd3b4e6a1129318cbfed4e0a6e001656635
Author: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Date: Fri Sep 4 15:46:31 2015 -0700

userfaultfd: add new syscall to provide memory externalization

+ /*
+ * If it's already released don't get it. This avoids to loop
+ * in __get_user_pages if userfaultfd_release waits on the
+ * caller of handle_userfault to release the mmap_sem.
+ */
+ if (unlikely(ACCESS_ONCE(ctx->released)))
+ return VM_FAULT_SIGBUS;
+

Then we switched over to the friendly way, assuming CRIU could close() the
uffd during the monitee process running, in:

commit 656710a60e3693911bee3a355d2f2bbae3faba33
Author: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Date: Fri Sep 8 16:12:42 2017 -0700

userfaultfd: non-cooperative: closing the uffd without triggering SIGBUS

I had a feeling that after that we didn't test gup (I assume normal page
fault path will still work). Let me copy Mike too for that just in case he
has anything to say. Paste thread again:

https://lore.kernel.org/lkml/79375b71-db2e-3e66-346b-254c90d915e2@xxxxxxxxxxxxxxxxx/T/#u

My understanding is that releasing mmap lock here should work, but we need
to move the code a bit. Dimitris, please feel free to try the patch
attached here if you want. It's probably not a major use case of uffd over
kvm (IIUC unregister before close() will also work?), but if it's trivial
to fix we should proably fix it.

Thanks,

===8<===