Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

From: Andrea Arcangeli
Date: Tue Dec 22 2020 - 14:35:04 EST


On Mon, Dec 21, 2020 at 07:19:35PM -0800, Andy Lutomirski wrote:
> instance of this (with mmap_sem held for write) in x86:
> mark_screen_rdonly(). Dare I ask how broken this is? We could likely

That one is buggy despite it takes the mmap_write_lock... inverting
the last two lines would fix it though.

- mmap_write_unlock(mm);
flush_tlb_mm_range(mm, 0xA0000, 0xA0000 + 32*PAGE_SIZE, PAGE_SHIFT, false);
+ mmap_write_unlock(mm);

The issue here is that rightfully wp_page_copy copies outside the PT
lock (good thing) and it correctly assumes during the copy nobody can
modify the page if the fault happened in a write access and the pte
had no _PAGE_WRITE bit set.

The bug is clearly in userfaultfd_writeprotect that doesn't enforce
the above invariant.

If the userfaultfd_wrprotect(mode_wp = false) run after the deferred
TLB flush this could never happen because the uffd_wp bit was still
set and the page fault would hit the handle_userfault dead end and
it'd have to be restarted from scratch.

Leaving stale tlb entries after dropping the PT lock and with only the
mmap_lock_read is only ok if the page fault has a "check" that catches
that specific case, so that specific case is fully serialized by the
PT lock alone and the "catch" path is fully aware about the stale TLB
entries that were left around.

So looking at the two cases that predates uffd-wp that already did a
RW->RO transition with the mmap_lock_read, they both comply with the
wp_page_copy invariant but with two different techniques:

1) change_protection is called by the scheduler with the
mmap_read_lock and with a deferred TLB flush, and things don't fall
apart completely there simply because we catch that in do_numa_page:

if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
+ /* catch it: no risk to end up in wp_page_copy even if the change_protection
+ is running concurrently and the tlb flush is still pending */
return do_numa_page(vmf);

do_numa_page doesn't issue any further tlb flush, but basically
restores the pte to the previous value, so then the pending flush
becomes a noop, since there was no modification to the pte in the
first place after do_numa_page runs.

2) ksm write_protect_page is also called with mmap_read_lock, and it
will simply not defer the flush. If the tlb flush is done inside
the PT lock it is ok to take mmap_read_lock since the page fault
code will not make any assumption about the TLB before reading the
pagetable content under PT lock.

In fact if mm_tlb_flush_pending(mm) is true, write_protect_page in
KSM has to issue an extra synchronous flush before dropping the PT
lock, even if it finds the _PAGE_WRITE bit is already clear,
precisely because there can be another deferred flush that cleared
the pte but didn't flush the TLB yet.

userfaultfd_writeprotect() is already using the technique in 1) to be
safe, except the un-wrprotect can break it if run while the tlb flush
is still pending...

The good thing is this guarantee must be provided not more granular
even than a vma, not per-mm and a vma cannot be registered on two
different uffd ctx at the same time, so the locking can happen all
internal to the uffd ctx.

My previous suggestion to use a mutex to serialize
userfaultfd_writeprotect with a mutex will still work, but we can run
as many wrprotect and un-wrprotect as we want in parallel, as long as
they're not simultaneous, we can do much better than a mutex.

Ideally we would need a new two_group_semaphore, where each group can
run as many parallel instances as it wants, but no instance of one
group can run in parallel with any instance of the other group. AFIK
such a kind of lock doesn't exist right now.

wrprotect if left alone never had an issue since we had a working
"catch" check for it well before wp_page_copy could run. unprotect
also if left alone was ok since it's a permission upgrade.

Alternatively, we can solve this with the same technique as in 2),
because all it matters is that the 4k pte modification doesn't take
the mmap_lock_write. A modification to a single 4k pte or a single 2m
hugepmd is very likely indication that there's going to be a flood of
those in parallel from all CPUs and likely there's also a flood of
page faults from all CPUs in parallel. In addition for a 4k wrprotect
or un-wrprotect there's not a significant benefit in deferring the TLB
flush.

I don't think we can take the mmap_write_lock unconditionally because
one of the main reasons to deal with the extra complexity of resolving
page faults in userland is to bypass mprotect entirely.

Example, JITs can use unprivileged uffd to eliminate the software-set
dirty bit every time it modifies memory, that would then require
frequent wrprotect/un-wrprotect during page faults and the less likely
we have to block for I/O during those CPU-only events, the
better. This is one of the potential use cases, but there's plenty
more.

So the alternative would be to take mmap_read_lock for "len <=
HPAGE_PMD_SIZE" and the mmap_write_lock otherwise, and to add a
parameter in change_protection to tell it to flush before dropping the
PT lock and not defer the flush. So this is going to be more intrusive
in the VM and it's overall unnecessary.

The below is mostly untested... but it'd be good to hear some feedback
before doing more work in this direction.