Re: [PATCH v2 2/3] userfaultfd: UFFDIO_REMAP uABI

From: David Hildenbrand
Date: Wed Sep 27 2023 - 09:30:37 EST


+static int remap_anon_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+ struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma,
+ unsigned long dst_addr, unsigned long src_addr,
+ pte_t *dst_pte, pte_t *src_pte,
+ pte_t orig_dst_pte, pte_t orig_src_pte,
+ spinlock_t *dst_ptl, spinlock_t *src_ptl,
+ struct folio *src_folio)
+{
+ struct anon_vma *dst_anon_vma;
+
+ double_pt_lock(dst_ptl, src_ptl);
+
+ if (!pte_same(*src_pte, orig_src_pte) ||
+ !pte_same(*dst_pte, orig_dst_pte) ||
+ folio_test_large(src_folio) ||
+ folio_estimated_sharers(src_folio) != 1) {

^ here you should check PageAnonExclusive. Please get rid of any implicit explicit/implcit mapcount checks.

+ double_pt_unlock(dst_ptl, src_ptl);
+ return -EAGAIN;
+ }
+
+ BUG_ON(!folio_test_anon(src_folio));
+
+ dst_anon_vma = (void *)dst_vma->anon_vma + PAGE_MAPPING_ANON;
+ WRITE_ONCE(src_folio->mapping,
+ (struct address_space *) dst_anon_vma);

I have some cleanups pending for page_move_anon_rmap(), that moves the SetPageAnonExclusive hunk out. Here we should be using page_move_anon_rmap() [or rather, folio_move_anon_rmap() after my cleanups]

I'll send them out soonish.

+ WRITE_ONCE(src_folio->index, linear_page_index(dst_vma,
+ dst_addr)); >> +
+ orig_src_pte = ptep_clear_flush(src_vma, src_addr, src_pte);
+ orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot);
+ orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
+ dst_vma);

I think there's still a theoretical issue here that you could fix by
checking for the AnonExclusive flag, similar to the huge page case.

Consider the following scenario:

1. process P1 does a write fault in a private anonymous VMA, creating
and mapping a new anonymous page A1
2. process P1 forks and creates two children P2 and P3. afterwards, A1
is mapped in P1, P2 and P3 as a COW page, with mapcount 3.
3. process P1 removes its mapping of A1, dropping its mapcount to 2.
4. process P2 uses vmsplice() to grab a reference to A1 with get_user_pages()
5. process P2 removes its mapping of A1, dropping its mapcount to 1.

If at this point P3 does a write fault on its mapping of A1, it will
still trigger copy-on-write thanks to the AnonExclusive mechanism; and
this is necessary to avoid P3 mapping A1 as writable and writing data
into it that will become visible to P2, if P2 and P3 are in different
security contexts.

But if P3 instead moves its mapping of A1 to another address with
remap_anon_pte() which only does a page mapcount check, the
maybe_mkwrite() will directly make the mapping writable, circumventing
the AnonExclusive mechanism.


Yes, can_change_pte_writable() contains the exact logic when we can turn something easily writable even if it wasn't writable before. which includes that PageAnonExclusive is set. (but with uffd-wp or softdirty tracking, there is more to consider)

--
Cheers,

David / dhildenb