Re: [PATCH v2 2/3] userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx

From: Lokesh Gidra
Date: Mon Jan 29 2024 - 17:37:07 EST


On Mon, Jan 29, 2024 at 1:00 PM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
>
> * Lokesh Gidra <lokeshgidra@xxxxxxxxxx> [240129 14:35]:
> > Increments and loads to mmap_changing are always in mmap_lock
> > critical section.
>
> Read or write?
>
It's write-mode when incrementing (except in case of
userfaultfd_remove() where it's done in read-mode) and loads are in
mmap_lock (read-mode). I'll clarify this in the next version.
>
> > This ensures that if userspace requests event
> > notification for non-cooperative operations (e.g. mremap), userfaultfd
> > operations don't occur concurrently.
> >
> > This can be achieved by using a separate read-write semaphore in
> > userfaultfd_ctx such that increments are done in write-mode and loads
> > in read-mode, thereby eliminating the dependency on mmap_lock for this
> > purpose.
> >
> > This is a preparatory step before we replace mmap_lock usage with
> > per-vma locks in fill/move ioctls.
> >
> > Signed-off-by: Lokesh Gidra <lokeshgidra@xxxxxxxxxx>
> > ---
> > fs/userfaultfd.c | 40 ++++++++++++----------
> > include/linux/userfaultfd_k.h | 31 ++++++++++--------
> > mm/userfaultfd.c | 62 ++++++++++++++++++++---------------
> > 3 files changed, 75 insertions(+), 58 deletions(-)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 58331b83d648..c00a021bcce4 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -685,12 +685,15 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
> > ctx->flags = octx->flags;
> > ctx->features = octx->features;
> > ctx->released = false;
> > + init_rwsem(&ctx->map_changing_lock);
> > atomic_set(&ctx->mmap_changing, 0);
> > ctx->mm = vma->vm_mm;
> > mmgrab(ctx->mm);
> >
> > userfaultfd_ctx_get(octx);
> > + down_write(&octx->map_changing_lock);
> > atomic_inc(&octx->mmap_changing);
> > + up_write(&octx->map_changing_lock);
>
> This can potentially hold up your writer as the readers execute. I
> think this will change your priority (ie: priority inversion)?

Priority inversion, if any, is already happening due to mmap_lock, no?
Also, I thought rw_semaphore implementation is fair, so the writer
will eventually get the lock right? Please correct me if I'm wrong.

At this patch: there can't be any readers as they need to acquire
mmap_lock in read-mode first. While writers, at the point of
incrementing mmap_changing, already hold mmap_lock in write-mode.

With per-vma locks, the same synchronization that mmap_lock achieved
around mmap_changing, will be achieved by ctx->map_changing_lock.
>
> You could use the first bit of the atomic_inc as indication of a write.
> So if the mmap_changing is even, then there are no writers. If it
> didn't change and it's even then you know no modification has happened
> (or it overflowed and hit the same number which would be rare, but
> maybe okay?).

This is already achievable, right? If mmap_changing is >0 then we know
there are writers. The problem is that we want writers (like mremap
operations) to block as long as there is a userfaultfd operation (also
reader of mmap_changing) going on. Please note that I'm inferring this
from current implementation.

AFAIU, mmap_changing isn't required for correctness, because all
operations are happening under the right mode of mmap_lock. It's used
to ensure that while a non-cooperative operations is happening, if the
user has asked it to be notified, then no other userfaultfd operations
should take place until the user gets the event notification.
>
> > fctx->orig = octx;
> > fctx->new = ctx;
> > list_add_tail(&fctx->list, fcs);
> > @@ -737,7 +740,9 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
> > if (ctx->features & UFFD_FEATURE_EVENT_REMAP) {
> > vm_ctx->ctx = ctx;
> > userfaultfd_ctx_get(ctx);
> > + down_write(&ctx->map_changing_lock);
> > atomic_inc(&ctx->mmap_changing);
> > + up_write(&ctx->map_changing_lock);
> > } else {
> > /* Drop uffd context if remap feature not enabled */
> > vma_start_write(vma);
> > @@ -783,7 +788,9 @@ bool userfaultfd_remove(struct vm_area_struct *vma,
> > return true;
> >
> > userfaultfd_ctx_get(ctx);
> > + down_write(&ctx->map_changing_lock);
> > atomic_inc(&ctx->mmap_changing);
> > + up_write(&ctx->map_changing_lock);
> > mmap_read_unlock(mm);
> >
> > msg_init(&ewq.msg);
> > @@ -825,7 +832,9 @@ int userfaultfd_unmap_prep(struct vm_area_struct *vma, unsigned long start,
> > return -ENOMEM;
> >
> > userfaultfd_ctx_get(ctx);
> > + down_write(&ctx->map_changing_lock);
> > atomic_inc(&ctx->mmap_changing);
> > + up_write(&ctx->map_changing_lock);
> > unmap_ctx->ctx = ctx;
> > unmap_ctx->start = start;
> > unmap_ctx->end = end;
> > @@ -1709,9 +1718,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
> > if (uffdio_copy.mode & UFFDIO_COPY_MODE_WP)
> > flags |= MFILL_ATOMIC_WP;
> > if (mmget_not_zero(ctx->mm)) {
> > - ret = mfill_atomic_copy(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
> > - uffdio_copy.len, &ctx->mmap_changing,
> > - flags);
> > + ret = mfill_atomic_copy(ctx, uffdio_copy.dst, uffdio_copy.src,
> > + uffdio_copy.len, flags);
> > mmput(ctx->mm);
> > } else {
> > return -ESRCH;
> > @@ -1761,9 +1769,8 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
> > goto out;
> >
> > if (mmget_not_zero(ctx->mm)) {
> > - ret = mfill_atomic_zeropage(ctx->mm, uffdio_zeropage.range.start,
> > - uffdio_zeropage.range.len,
> > - &ctx->mmap_changing);
> > + ret = mfill_atomic_zeropage(ctx, uffdio_zeropage.range.start,
> > + uffdio_zeropage.range.len);
> > mmput(ctx->mm);
> > } else {
> > return -ESRCH;
> > @@ -1818,9 +1825,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > return -EINVAL;
> >
> > if (mmget_not_zero(ctx->mm)) {
> > - ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
> > - uffdio_wp.range.len, mode_wp,
> > - &ctx->mmap_changing);
> > + ret = mwriteprotect_range(ctx, uffdio_wp.range.start,
> > + uffdio_wp.range.len, mode_wp);
> > mmput(ctx->mm);
> > } else {
> > return -ESRCH;
> > @@ -1870,9 +1876,8 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
> > flags |= MFILL_ATOMIC_WP;
> >
> > if (mmget_not_zero(ctx->mm)) {
> > - ret = mfill_atomic_continue(ctx->mm, uffdio_continue.range.start,
> > - uffdio_continue.range.len,
> > - &ctx->mmap_changing, flags);
> > + ret = mfill_atomic_continue(ctx, uffdio_continue.range.start,
> > + uffdio_continue.range.len, flags);
> > mmput(ctx->mm);
> > } else {
> > return -ESRCH;
> > @@ -1925,9 +1930,8 @@ static inline int userfaultfd_poison(struct userfaultfd_ctx *ctx, unsigned long
> > goto out;
> >
> > if (mmget_not_zero(ctx->mm)) {
> > - ret = mfill_atomic_poison(ctx->mm, uffdio_poison.range.start,
> > - uffdio_poison.range.len,
> > - &ctx->mmap_changing, 0);
> > + ret = mfill_atomic_poison(ctx, uffdio_poison.range.start,
> > + uffdio_poison.range.len, 0);
> > mmput(ctx->mm);
> > } else {
> > return -ESRCH;
> > @@ -2003,13 +2007,14 @@ static int userfaultfd_move(struct userfaultfd_ctx *ctx,
> > if (mmget_not_zero(mm)) {
> > mmap_read_lock(mm);
> >
> > - /* Re-check after taking mmap_lock */
> > + /* Re-check after taking map_changing_lock */
> > + down_read(&ctx->map_changing_lock);
> > if (likely(!atomic_read(&ctx->mmap_changing)))
> > ret = move_pages(ctx, mm, uffdio_move.dst, uffdio_move.src,
> > uffdio_move.len, uffdio_move.mode);
> > else
> > ret = -EAGAIN;
> > -
> > + up_read(&ctx->map_changing_lock);
> > mmap_read_unlock(mm);
> > mmput(mm);
> > } else {
> > @@ -2216,6 +2221,7 @@ static int new_userfaultfd(int flags)
> > ctx->flags = flags;
> > ctx->features = 0;
> > ctx->released = false;
> > + init_rwsem(&ctx->map_changing_lock);
> > atomic_set(&ctx->mmap_changing, 0);
> > ctx->mm = current->mm;
> > /* prevent the mm struct to be freed */
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index 691d928ee864..3210c3552976 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -69,6 +69,13 @@ struct userfaultfd_ctx {
> > unsigned int features;
> > /* released */
> > bool released;
> > + /*
> > + * Prevents userfaultfd operations (fill/move/wp) from happening while
> > + * some non-cooperative event(s) is taking place. Increments are done
> > + * in write-mode. Whereas, userfaultfd operations, which includes
> > + * reading mmap_changing, is done under read-mode.
> > + */
> > + struct rw_semaphore map_changing_lock;
> > /* memory mappings are changing because of non-cooperative event */
> > atomic_t mmap_changing;
> > /* mm with one ore more vmas attached to this userfaultfd_ctx */
> > @@ -113,22 +120,18 @@ extern int mfill_atomic_install_pte(pmd_t *dst_pmd,
> > unsigned long dst_addr, struct page *page,
> > bool newly_allocated, uffd_flags_t flags);
> >
> > -extern ssize_t mfill_atomic_copy(struct mm_struct *dst_mm, unsigned long dst_start,
> > +extern ssize_t mfill_atomic_copy(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> > unsigned long src_start, unsigned long len,
> > - atomic_t *mmap_changing, uffd_flags_t flags);
> > -extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm,
> > + uffd_flags_t flags);
> > +extern ssize_t mfill_atomic_zeropage(struct userfaultfd_ctx *ctx,
> > unsigned long dst_start,
> > - unsigned long len,
> > - atomic_t *mmap_changing);
> > -extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start,
> > - unsigned long len, atomic_t *mmap_changing,
> > - uffd_flags_t flags);
> > -extern ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start,
> > - unsigned long len, atomic_t *mmap_changing,
> > - uffd_flags_t flags);
> > -extern int mwriteprotect_range(struct mm_struct *dst_mm,
> > - unsigned long start, unsigned long len,
> > - bool enable_wp, atomic_t *mmap_changing);
> > + unsigned long len);
> > +extern ssize_t mfill_atomic_continue(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> > + unsigned long len, uffd_flags_t flags);
> > +extern ssize_t mfill_atomic_poison(struct userfaultfd_ctx *ctx, unsigned long start,
> > + unsigned long len, uffd_flags_t flags);
> > +extern int mwriteprotect_range(struct userfaultfd_ctx *ctx, unsigned long start,
> > + unsigned long len, bool enable_wp);
> > extern long uffd_wp_range(struct vm_area_struct *vma,
> > unsigned long start, unsigned long len, bool enable_wp);
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index e3a91871462a..6e2ca04ab04d 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -353,11 +353,11 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
> > * called with mmap_lock held, it will release mmap_lock before returning.
> > */
> > static __always_inline ssize_t mfill_atomic_hugetlb(
> > + struct userfaultfd_ctx *ctx,
> > struct vm_area_struct *dst_vma,
> > unsigned long dst_start,
> > unsigned long src_start,
> > unsigned long len,
> > - atomic_t *mmap_changing,
> > uffd_flags_t flags)
> > {
> > struct mm_struct *dst_mm = dst_vma->vm_mm;
> > @@ -379,6 +379,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
> > * feature is not supported.
> > */
> > if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
> > + up_read(&ctx->map_changing_lock);
> > mmap_read_unlock(dst_mm);
> > return -EINVAL;
> > }
> > @@ -463,6 +464,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
> > cond_resched();
> >
> > if (unlikely(err == -ENOENT)) {
> > + up_read(&ctx->map_changing_lock);
> > mmap_read_unlock(dst_mm);
> > BUG_ON(!folio);
> >
> > @@ -473,12 +475,13 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
> > goto out;
> > }
> > mmap_read_lock(dst_mm);
> > + down_read(&ctx->map_changing_lock);
> > /*
> > * If memory mappings are changing because of non-cooperative
> > * operation (e.g. mremap) running in parallel, bail out and
> > * request the user to retry later
> > */
> > - if (mmap_changing && atomic_read(mmap_changing)) {
> > + if (atomic_read(ctx->mmap_changing)) {
> > err = -EAGAIN;
> > break;
> > }
> > @@ -501,6 +504,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
> > }
> >
> > out_unlock:
> > + up_read(&ctx->map_changing_lock);
> > mmap_read_unlock(dst_mm);
> > out:
> > if (folio)
> > @@ -512,11 +516,11 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
> > }
> > #else /* !CONFIG_HUGETLB_PAGE */
> > /* fail at build time if gcc attempts to use this */
> > -extern ssize_t mfill_atomic_hugetlb(struct vm_area_struct *dst_vma,
> > +extern ssize_t mfill_atomic_hugetlb(struct userfaultfd_ctx *ctx,
> > + struct vm_area_struct *dst_vma,
> > unsigned long dst_start,
> > unsigned long src_start,
> > unsigned long len,
> > - atomic_t *mmap_changing,
> > uffd_flags_t flags);
> > #endif /* CONFIG_HUGETLB_PAGE */
> >
> > @@ -564,13 +568,13 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
> > return err;
> > }
> >
> > -static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm,
> > +static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
> > unsigned long dst_start,
> > unsigned long src_start,
> > unsigned long len,
> > - atomic_t *mmap_changing,
> > uffd_flags_t flags)
> > {
> > + struct mm_struct *dst_mm = ctx->mm;
> > struct vm_area_struct *dst_vma;
> > ssize_t err;
> > pmd_t *dst_pmd;
> > @@ -600,8 +604,9 @@ static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm,
> > * operation (e.g. mremap) running in parallel, bail out and
> > * request the user to retry later
> > */
> > + down_read(&ctx->map_changing_lock);
> > err = -EAGAIN;
> > - if (mmap_changing && atomic_read(mmap_changing))
> > + if (atomic_read(&ctx->mmap_changing))
> > goto out_unlock;
> >
> > /*
> > @@ -633,8 +638,8 @@ static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm,
> > * If this is a HUGETLB vma, pass off to appropriate routine
> > */
> > if (is_vm_hugetlb_page(dst_vma))
> > - return mfill_atomic_hugetlb(dst_vma, dst_start, src_start,
> > - len, mmap_changing, flags);
> > + return mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
> > + src_start, len, flags);
> >
> > if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
> > goto out_unlock;
> > @@ -693,6 +698,7 @@ static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm,
> > if (unlikely(err == -ENOENT)) {
> > void *kaddr;
> >
> > + up_read(&ctx->map_changing_lock);
> > mmap_read_unlock(dst_mm);
> > BUG_ON(!folio);
> >
> > @@ -723,6 +729,7 @@ static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm,
> > }
> >
> > out_unlock:
> > + up_read(&ctx->map_changing_lock);
> > mmap_read_unlock(dst_mm);
> > out:
> > if (folio)
> > @@ -733,34 +740,33 @@ static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm,
> > return copied ? copied : err;
> > }
> >
> > -ssize_t mfill_atomic_copy(struct mm_struct *dst_mm, unsigned long dst_start,
> > +ssize_t mfill_atomic_copy(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> > unsigned long src_start, unsigned long len,
> > - atomic_t *mmap_changing, uffd_flags_t flags)
> > + uffd_flags_t flags)
> > {
> > - return mfill_atomic(dst_mm, dst_start, src_start, len, mmap_changing,
> > + return mfill_atomic(ctx, dst_start, src_start, len,
> > uffd_flags_set_mode(flags, MFILL_ATOMIC_COPY));
> > }
> >
> > -ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, unsigned long start,
> > - unsigned long len, atomic_t *mmap_changing)
> > +ssize_t mfill_atomic_zeropage(struct userfaultfd_ctx *ctx,
> > + unsigned long start,
> > + unsigned long len)
> > {
> > - return mfill_atomic(dst_mm, start, 0, len, mmap_changing,
> > + return mfill_atomic(ctx, start, 0, len,
> > uffd_flags_set_mode(0, MFILL_ATOMIC_ZEROPAGE));
> > }
> >
> > -ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start,
> > - unsigned long len, atomic_t *mmap_changing,
> > - uffd_flags_t flags)
> > +ssize_t mfill_atomic_continue(struct userfaultfd_ctx *ctx, unsigned long start,
> > + unsigned long len, uffd_flags_t flags)
> > {
> > - return mfill_atomic(dst_mm, start, 0, len, mmap_changing,
> > + return mfill_atomic(ctx, start, 0, len,
> > uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE));
> > }
> >
> > -ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start,
> > - unsigned long len, atomic_t *mmap_changing,
> > - uffd_flags_t flags)
> > +ssize_t mfill_atomic_poison(struct userfaultfd_ctx *ctx, unsigned long start,
> > + unsigned long len, uffd_flags_t flags)
> > {
> > - return mfill_atomic(dst_mm, start, 0, len, mmap_changing,
> > + return mfill_atomic(ctx, start, 0, len,
> > uffd_flags_set_mode(flags, MFILL_ATOMIC_POISON));
> > }
> >
> > @@ -793,10 +799,10 @@ long uffd_wp_range(struct vm_area_struct *dst_vma,
> > return ret;
> > }
> >
> > -int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> > - unsigned long len, bool enable_wp,
> > - atomic_t *mmap_changing)
> > +int mwriteprotect_range(struct userfaultfd_ctx *ctx, unsigned long start,
> > + unsigned long len, bool enable_wp)
> > {
> > + struct mm_struct *dst_mm = ctx->mm;
> > unsigned long end = start + len;
> > unsigned long _start, _end;
> > struct vm_area_struct *dst_vma;
> > @@ -820,8 +826,9 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> > * operation (e.g. mremap) running in parallel, bail out and
> > * request the user to retry later
> > */
> > + down_read(&ctx->map_changing_lock);
> > err = -EAGAIN;
> > - if (mmap_changing && atomic_read(mmap_changing))
> > + if (atomic_read(&ctx->mmap_changing))
> > goto out_unlock;
> >
> > err = -ENOENT;
> > @@ -850,6 +857,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> > err = 0;
> > }
> > out_unlock:
> > + up_read(&ctx->map_changing_lock);
> > mmap_read_unlock(dst_mm);
> > return err;
> > }
> > --
> > 2.43.0.429.g432eaa2c6b-goog
> >
> >