potential new userfaultfd vs khugepaged conflict [was: Re: [PATCH v2 2/3] userfaultfd: UFFDIO_REMAP uABI]

From: Jann Horn
Date: Wed Sep 27 2023 - 06:07:16 EST


[moving Hugh into "To:" recipients as FYI for khugepaged interaction]

On Sat, Sep 23, 2023 at 3:31 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
>
> This implements the uABI of UFFDIO_REMAP.
>
> Notably one mode bitflag is also forwarded (and in turn known) by the
> lowlevel remap_pages method.
>
> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>
[...]
> +/*
> + * The mmap_lock for reading is held by the caller. Just move the page
> + * from src_pmd to dst_pmd if possible, and return true if succeeded
> + * in moving the page.
> + */
> +static int remap_pages_pte(struct mm_struct *dst_mm,
> + struct mm_struct *src_mm,
> + pmd_t *dst_pmd,
> + pmd_t *src_pmd,
> + struct vm_area_struct *dst_vma,
> + struct vm_area_struct *src_vma,
> + unsigned long dst_addr,
> + unsigned long src_addr,
> + __u64 mode)
> +{
> + swp_entry_t entry;
> + pte_t orig_src_pte, orig_dst_pte;
> + spinlock_t *src_ptl, *dst_ptl;
> + pte_t *src_pte = NULL;
> + pte_t *dst_pte = NULL;
> +
> + struct folio *src_folio = NULL;
> + struct anon_vma *src_anon_vma = NULL;
> + struct mmu_notifier_range range;
> + int err = 0;
> +
> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_mm,
> + src_addr, src_addr + PAGE_SIZE);
> + mmu_notifier_invalidate_range_start(&range);
> +retry:
> + dst_pte = pte_offset_map_nolock(dst_mm, dst_pmd, dst_addr, &dst_ptl);
> +
> + /* If an huge pmd materialized from under us fail */
> + if (unlikely(!dst_pte)) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> + src_pte = pte_offset_map_nolock(src_mm, src_pmd, src_addr, &src_ptl);
> +
> + /*
> + * We held the mmap_lock for reading so MADV_DONTNEED
> + * can zap transparent huge pages under us, or the
> + * transparent huge page fault can establish new
> + * transparent huge pages under us.
> + */
> + if (unlikely(!src_pte)) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> + BUG_ON(pmd_none(*dst_pmd));
> + BUG_ON(pmd_none(*src_pmd));
> + BUG_ON(pmd_trans_huge(*dst_pmd));
> + BUG_ON(pmd_trans_huge(*src_pmd));

This works for now, but note that Hugh Dickins has recently been
reworking khugepaged such that PTE-based mappings can be collapsed
into transhuge mappings under the mmap lock held in *read mode*;
holders of the mmap lock in read mode can only synchronize against
this by taking the right page table spinlock and rechecking the pmd
value. This is only the case for file-based mappings so far, not for
anonymous private VMAs; and this code only operates on anonymous
private VMAs so far, so it works out.

But if either Hugh further reworks khugepaged such that anonymous VMAs
can be collapsed under the mmap lock in read mode, or you expand this
code to work on file-backed VMAs, then it will become possible to hit
these BUG_ON() calls. I'm not sure what the plans for khugepaged going
forward are, but the number of edgecases everyone has to keep in mind
would go down if you changed this function to deal gracefully with
page tables disappearing under you.

In the newest version of mm/pgtable-generic.c, above
__pte_offset_map_lock(), there is a big comment block explaining the
current rules for page table access; in particular, regarding the
helper pte_offset_map_nolock() that you're using:

* pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
* but when successful, it also outputs a pointer to the spinlock in ptlp - as
* pte_offset_map_lock() does, but in this case without locking it. This helps
* the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
* act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock
* pointer for the page table that it returns. In principle, the caller should
* recheck *pmd once the lock is taken; in practice, no callsite needs that -
* either the mmap_lock for write, or pte_same() check on contents, is enough.

If this becomes hittable in the future, I think you will need to
recheck *pmd, at least for dst_pte, to avoid copying PTEs into a
detached page table.

> + spin_lock(dst_ptl);
> + orig_dst_pte = *dst_pte;
> + spin_unlock(dst_ptl);
> + if (!pte_none(orig_dst_pte)) {
> + err = -EEXIST;
> + goto out;
> + }
> +
> + spin_lock(src_ptl);
> + orig_src_pte = *src_pte;
> + spin_unlock(src_ptl);
> + if (pte_none(orig_src_pte)) {
> + if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES))
> + err = -ENOENT;
> + else /* nothing to do to remap a hole */
> + err = 0;
> + goto out;
> + }
> +
> + if (pte_present(orig_src_pte)) {
> + /*
> + * Pin and lock both source folio and anon_vma. Since we are in
> + * RCU read section, we can't block, so on contention have to
> + * unmap the ptes, obtain the lock and retry.
> + */
> + if (!src_folio) {
> + struct folio *folio;
> +
> + /*
> + * Pin the page while holding the lock to be sure the
> + * page isn't freed under us
> + */
> + spin_lock(src_ptl);
> + if (!pte_same(orig_src_pte, *src_pte)) {
> + spin_unlock(src_ptl);
> + err = -EAGAIN;
> + goto out;
> + }
> +
> + folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> + if (!folio || !folio_test_anon(folio) ||
> + folio_test_large(folio) ||
> + folio_estimated_sharers(folio) != 1) {
> + spin_unlock(src_ptl);
> + err = -EBUSY;
> + goto out;
> + }
> +
> + folio_get(folio);
> + src_folio = folio;
> + spin_unlock(src_ptl);
> +
> + /* block all concurrent rmap walks */
> + if (!folio_trylock(src_folio)) {
> + pte_unmap(&orig_src_pte);
> + pte_unmap(&orig_dst_pte);
> + src_pte = dst_pte = NULL;
> + /* now we can block and wait */
> + folio_lock(src_folio);
> + goto retry;
> + }
> + }
> +
> + if (!src_anon_vma) {
> + /*
> + * folio_referenced walks the anon_vma chain
> + * without the folio lock. Serialize against it with
> + * the anon_vma lock, the folio lock is not enough.
> + */
> + src_anon_vma = folio_get_anon_vma(src_folio);
> + if (!src_anon_vma) {
> + /* page was unmapped from under us */
> + err = -EAGAIN;
> + goto out;
> + }
> + if (!anon_vma_trylock_write(src_anon_vma)) {
> + pte_unmap(&orig_src_pte);
> + pte_unmap(&orig_dst_pte);
> + src_pte = dst_pte = NULL;
> + /* now we can block and wait */
> + anon_vma_lock_write(src_anon_vma);
> + goto retry;
> + }
> + }
> +
> + err = remap_anon_pte(dst_mm, src_mm, dst_vma, src_vma,
> + dst_addr, src_addr, dst_pte, src_pte,
> + orig_dst_pte, orig_src_pte,
> + dst_ptl, src_ptl, src_folio);
> + } else {
> + entry = pte_to_swp_entry(orig_src_pte);
> + if (non_swap_entry(entry)) {
> + if (is_migration_entry(entry)) {
> + pte_unmap(&orig_src_pte);
> + pte_unmap(&orig_dst_pte);
> + src_pte = dst_pte = NULL;
> + migration_entry_wait(src_mm, src_pmd,
> + src_addr);
> + err = -EAGAIN;
> + } else
> + err = -EFAULT;
> + goto out;
> + }
> +
> + err = remap_swap_pte(dst_mm, src_mm, dst_addr, src_addr,
> + dst_pte, src_pte,
> + orig_dst_pte, orig_src_pte,
> + dst_ptl, src_ptl);
> + }
> +
> +out:
> + if (src_anon_vma) {
> + anon_vma_unlock_write(src_anon_vma);
> + put_anon_vma(src_anon_vma);
> + }
> + if (src_folio) {
> + folio_unlock(src_folio);
> + folio_put(src_folio);
> + }
> + if (dst_pte)
> + pte_unmap(dst_pte);
> + if (src_pte)
> + pte_unmap(src_pte);
> + mmu_notifier_invalidate_range_end(&range);
> +
> + return err;
> +}