Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

From: Yu Zhao
Date: Sun Dec 20 2020 - 04:55:36 EST


On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote:
> > On Dec 19, 2020, at 10:05 PM, Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> >
> > On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote:
> >> [ cc’ing some more people who have experience with similar problems ]
> >>
> >>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
> >>>
> >>> Hello,
> >>>
> >>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> >>>> Analyzing this problem indicates that there is a real bug since
> >>>> mmap_lock is only taken for read in mwriteprotect_range(). This might
> >>>
> >>> Never having to take the mmap_sem for writing, and in turn never
> >>> blocking, in order to modify the pagetables is quite an important
> >>> feature in uffd that justifies uffd instead of mprotect. It's not the
> >>> most important reason to use uffd, but it'd be nice if that guarantee
> >>> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
> >>> other pgtable manipulations.
> >>>
> >>>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> >>>>
> >>>> cpu0 cpu1
> >>>> ---- ----
> >>>> userfaultfd_writeprotect()
> >>>> [ write-protecting ]
> >>>> mwriteprotect_range()
> >>>> mmap_read_lock()
> >>>> change_protection()
> >>>> change_protection_range()
> >>>> ...
> >>>> change_pte_range()
> >>>> [ defer TLB flushes]
> >>>> userfaultfd_writeprotect()
> >>>> mmap_read_lock()
> >>>> change_protection()
> >>>> [ write-unprotect ]
> >>>> ...
> >>>> [ unprotect PTE logically ]
> >>>> ...
> >>>> [ page-fault]
> >>>> ...
> >>>> wp_page_copy()
> >>>> [ set new writable page in PTE]
> >
> > I don't see any problem in this example -- wp_page_copy() calls
> > ptep_clear_flush_notify(), which should take care of the stale entry
> > left by cpu0.
> >
> > That being said, I suspect the memory corruption you observed is
> > related this example, with cpu1 running something else that flushes
> > conditionally depending on pte_write().
> >
> > Do you know which type of pages were corrupted? file, anon, etc.
>
> First, Yu, you are correct. My analysis is incorrect, but let me have
> another try (below). To answer your (and Andrea’s) question - this happens
> with upstream without any changes, excluding a small fix to the selftest,
> since it failed (got stuck) due to missing wake events. [1]
>
> We are talking about anon memory.
>
> So to correct myself, I think that what I really encountered was actually
> during MM_CP_UFFD_WP_RESOLVE (i.e., when the protection is removed). The
> problem was that in this case the “write”-bit was removed during unprotect.

Thanks. You are right about when the problem happens: UFD write-
UNprotecting. But it's not UFD write-UNprotecting that removes the
writable bit -- the bit can only be removed during COW or UFD
write-protecting. So your original example was almost correct, except
the last line describing cpu1.

The problem is how do_wp_page() handles non-COW pages. (For COW pages,
do_wp_page() works correctly by either reusing an existing page or
make a new copy out of it.) In UFD case, the existing page may not
have been properly write-protected. As you pointed out, the tlb flush
may not be done yet. Making a copy can potentially race with the
writer on cpu2.

Should we fix the problem by ensuring integrity of the copy? IMO, no,
because do_wp_page() shouldn't copy at all in this case. It seems it
was recently broken by

be068f29034f mm: fix misplaced unlock_page in do_wp_page()
09854ba94c6a mm: do_wp_page() simplification

I haven't study them carefully. But if you could just revert them and
run the test again, we'd know where exactly to look at next.

> Sorry for the strange formatting to fit within 80 columns:
>
>
> [ Start: PTE is writable ]
>
> cpu0 cpu1 cpu2
> ---- ---- ----
> [ Writable PTE
> cached in TLB ]
> userfaultfd_writeprotect()
> [ write-*unprotect* ]
> mwriteprotect_range()
> mmap_read_lock()
> change_protection()
>
> change_protection_range()
> ...
> change_pte_range()
> [ *clear* “write”-bit ]
> [ defer TLB flushes]
> [ page-fault ]
>
> wp_page_copy()
> cow_user_page()
> [ copy page ]
> [ write to old
> page ]
>
> set_pte_at_notify()
>
> [ End: cpu2 write not copied form old to new page. ]
>
>
> So this was actually resolved by the second part of the patch - changing
> preserve_write in change_pte_range(). I removed the acquisition of mmap_lock
> for write, left the change in change_pte_range() and the test passes.
>
> Let me give some more thought on whether a mmap_lock is needed
> for write. I need to rehash this TLB flushing algorithm.
>
> Thanks,
> Nadav
>
> [1] https://lore.kernel.org/patchwork/patch/1346386