Re: [PATCH v2] mm: Fix memory ordering for mm_lock_seq and vm_lock_seq

From: Suren Baghdasaryan
Date: Fri Jul 21 2023 - 19:26:47 EST


On Fri, Jul 21, 2023 at 3:51 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
>
> mm->mm_lock_seq effectively functions as a read/write lock; therefore it
> must be used with acquire/release semantics.
>
> A specific example is the interaction between userfaultfd_register() and
> lock_vma_under_rcu().
> userfaultfd_register() does the following from the point where it changes
> a VMA's flags to the point where concurrent readers are permitted again
> (in a simple scenario where only a single private VMA is accessed and no
> merging/splitting is involved):
>
> userfaultfd_register
> userfaultfd_set_vm_flags
> vm_flags_reset
> vma_start_write
> down_write(&vma->vm_lock->lock)
> vma->vm_lock_seq = mm_lock_seq [marks VMA as busy]
> up_write(&vma->vm_lock->lock)
> vm_flags_init
> [sets VM_UFFD_* in __vm_flags]
> vma->vm_userfaultfd_ctx.ctx = ctx
> mmap_write_unlock
> vma_end_write_all
> WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1) [unlocks VMA]
>
> There are no memory barriers in between the __vm_flags update and the
> mm->mm_lock_seq update that unlocks the VMA, so the unlock can be reordered
> to above the `vm_flags_init()` call, which means from the perspective of a
> concurrent reader, a VMA can be marked as a userfaultfd VMA while it is not
> VMA-locked. That's bad, we definitely need a store-release for the unlock
> operation.
>
> The non-atomic write to vma->vm_lock_seq in vma_start_write() is mostly
> fine because all accesses to vma->vm_lock_seq that matter are always
> protected by the VMA lock. There is a racy read in vma_start_read() though
> that can tolerate false-positives, so we should be using WRITE_ONCE() to
> keep things tidy and data-race-free (including for KCSAN).
>
> On the other side, lock_vma_under_rcu() works as follows in the relevant
> region for locking and userfaultfd check:
>
> lock_vma_under_rcu
> vma_start_read
> vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [early bailout]
> down_read_trylock(&vma->vm_lock->lock)
> vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [main check]
> userfaultfd_armed
> checks vma->vm_flags & __VM_UFFD_FLAGS
>
> Here, the interesting aspect is how far down the mm->mm_lock_seq read
> can be reordered - if this read is reordered down below the vma->vm_flags
> access, this could cause lock_vma_under_rcu() to partly operate on
> information that was read while the VMA was supposed to be locked.
> To prevent this kind of downwards bleeding of the mm->mm_lock_seq read, we
> need to read it with a load-acquire.
>
> Some of the comment wording is based on suggestions by Suren.
>
> BACKPORT WARNING: One of the functions changed by this patch (which I've
> written against Linus' tree) is vma_try_start_write(), but this function
> no longer exists in mm/mm-everything. I don't know whether the merged
> version of this patch will be ordered before or after the patch that
> removes vma_try_start_write(). If you're backporting this patch to a
> tree with vma_try_start_write(), make sure this patch changes that
> function.
>
> Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it")
> Cc: stable@xxxxxxxxxxxxxxx
> Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> Signed-off-by: Jann Horn <jannh@xxxxxxxxxx>

Thanks for fixing the ordering and making the rules clear! I
completely missed the reordering issue during vma unlocking.

Reviewed-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>

> ---
>
> Notes:
> v2: made the comments much clearer based on off-list input from Suren
>
> include/linux/mm.h | 29 +++++++++++++++++++++++------
> include/linux/mm_types.h | 28 ++++++++++++++++++++++++++++
> include/linux/mmap_lock.h | 10 ++++++++--
> 3 files changed, 59 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2dd73e4f3d8e..406ab9ea818f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -641,8 +641,14 @@ static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
> */
> static inline bool vma_start_read(struct vm_area_struct *vma)
> {
> - /* Check before locking. A race might cause false locked result. */
> - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> + /*
> + * Check before locking. A race might cause false locked result.
> + * We can use READ_ONCE() for the mm_lock_seq here, and don't need
> + * ACQUIRE semantics, because this is just a lockless check whose result
> + * we don't rely on for anything - the mm_lock_seq read against which we
> + * need ordering is below.
> + */
> + if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq))
> return false;
>
> if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> @@ -653,8 +659,13 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> * False unlocked result is impossible because we modify and check
> * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> * modification invalidates all existing locks.
> + *
> + * We must use ACQUIRE semantics for the mm_lock_seq so that if we are
> + * racing with vma_end_write_all(), we only start reading from the VMA
> + * after it has been unlocked.
> + * This pairs with RELEASE semantics in vma_end_write_all().
> */
> - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> + if (unlikely(vma->vm_lock_seq == smp_load_acquire(&vma->vm_mm->mm_lock_seq))) {
> up_read(&vma->vm_lock->lock);
> return false;
> }
> @@ -676,7 +687,7 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, int *mm_lock_seq)
> * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> * mm->mm_lock_seq can't be concurrently modified.
> */
> - *mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> + *mm_lock_seq = vma->vm_mm->mm_lock_seq;
> return (vma->vm_lock_seq == *mm_lock_seq);
> }
>
> @@ -688,7 +699,13 @@ static inline void vma_start_write(struct vm_area_struct *vma)
> return;
>
> down_write(&vma->vm_lock->lock);
> - vma->vm_lock_seq = mm_lock_seq;
> + /*
> + * We should use WRITE_ONCE() here because we can have concurrent reads
> + * from the early lockless pessimistic check in vma_start_read().
> + * We don't really care about the correctness of that early check, but
> + * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy.
> + */
> + WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> up_write(&vma->vm_lock->lock);
> }
>
> @@ -702,7 +719,7 @@ static inline bool vma_try_start_write(struct vm_area_struct *vma)
> if (!down_write_trylock(&vma->vm_lock->lock))
> return false;
>
> - vma->vm_lock_seq = mm_lock_seq;
> + WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> up_write(&vma->vm_lock->lock);
> return true;
> }
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index de10fc797c8e..5e74ce4a28cd 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -514,6 +514,20 @@ struct vm_area_struct {
> };
>
> #ifdef CONFIG_PER_VMA_LOCK
> + /*
> + * Can only be written (using WRITE_ONCE()) while holding both:
> + * - mmap_lock (in write mode)
> + * - vm_lock->lock (in write mode)
> + * Can be read reliably while holding one of:
> + * - mmap_lock (in read or write mode)
> + * - vm_lock->lock (in read or write mode)
> + * Can be read unreliably (using READ_ONCE()) for pessimistic bailout
> + * while holding nothing (except RCU to keep the VMA struct allocated).
> + *
> + * This sequence counter is explicitly allowed to overflow; sequence
> + * counter reuse can only lead to occasional unnecessary use of the
> + * slowpath.
> + */
> int vm_lock_seq;
> struct vma_lock *vm_lock;
>
> @@ -679,6 +693,20 @@ struct mm_struct {
> * by mmlist_lock
> */
> #ifdef CONFIG_PER_VMA_LOCK
> + /*
> + * This field has lock-like semantics, meaning it is sometimes
> + * accessed with ACQUIRE/RELEASE semantics.
> + * Roughly speaking, incrementing the sequence number is
> + * equivalent to releasing locks on VMAs; reading the sequence
> + * number can be part of taking a read lock on a VMA.
> + *
> + * Can be modified under write mmap_lock using RELEASE
> + * semantics.
> + * Can be read with no other protection when holding write
> + * mmap_lock.
> + * Can be read with ACQUIRE semantics if not holding write
> + * mmap_lock.
> + */
> int mm_lock_seq;
> #endif
>
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index aab8f1b28d26..e05e167dbd16 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -76,8 +76,14 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
> static inline void vma_end_write_all(struct mm_struct *mm)
> {
> mmap_assert_write_locked(mm);
> - /* No races during update due to exclusive mmap_lock being held */
> - WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> + /*
> + * Nobody can concurrently modify mm->mm_lock_seq due to exclusive
> + * mmap_lock being held.
> + * We need RELEASE semantics here to ensure that preceding stores into
> + * the VMA take effect before we unlock it with this store.
> + * Pairs with ACQUIRE semantics in vma_start_read().
> + */
> + smp_store_release(&mm->mm_lock_seq, mm->mm_lock_seq + 1);
> }
> #else
> static inline void vma_end_write_all(struct mm_struct *mm) {}
>
> base-commit: d192f5382581d972c4ae1b4d72e0b59b34cadeb9
> --
> 2.41.0.487.g6d72f3e995-goog
>