Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

From: Andrea Arcangeli
Date: Tue Dec 22 2020 - 16:19:32 EST


On Tue, Dec 22, 2020 at 12:19:49PM -0800, Nadav Amit wrote:
> Perhaps any change to PTE in a page-table should increase a page-table
> generation that we would save in the page-table page-struct (next to the

The current rule is that by the time in the page fault we find a
pte/hugepmd in certain state under the "PT lock", the TLB cannot be
left stale with a more permissive state at that point.

So if under "PT lock" the page fault sees !pte_write, it means no
other CPU can possibly modify the page anymore.

That must be enforced at all times, no sequence number is required if
you enforce that and the whole point of the fix is to enforce that.

This is how things always worked in the page fault and it's perfectly
fine.

For the record I never suggested to change anything in the page fault
code.

So the only way we can leave stale TLB after dropping PT lock with the
mmap_read_lock and concurrent page faults is with a marker:

1) take PT lock
2) leave in the pte/hugepmd a unique identifiable marker
3) change the pte to downgrade permissions
4) drop PT lock and leave stale TLB entries with the upgraded permissions

In point 3) the pte is built in a way that is guaranteed to trigger a
deterministic path in the page fault. And in the way of that
deterministic path you put the marker check for 2) to send the fault
to a dead-end where the stale TLB is actually irrelevant and harmless.

No change to the page fault is needed if you only enforce the above.

Even if we'd take the mmap_write_lock in userfaultfd_writeprotect, you
will still have to deal with the above technique because of
change_prot_numa().

Would you suggest to add a sequence counter and to change all pte_same
in order to make change_prot_numa safer or is it safe enough already
using the above technique that checks the marker in the page fault?

To be safe NUMA balancing is using the same mechanism above of sending
the page fault into a dead end, in order to call the very same
function (change_permissions) with the very same lock (mmap_read_lock)
as userfaultfd_writeprotect.

What happens then in do_numa_page then is different than what happens
in handle_userfault, but both are a dead ends as far as the page fault
code is concerned and they will never come back to the page fault
code. That's how they avoid breaking the page fault.

The final rule for the above logic to work safe for uffd, is that the
marker cannot be cleared until after the deferred TLB flush is
executed (that's where the window for the race existed, and the fix
closed such window).

do_numa_page differs in clearing the marker while the TLB flush still
pending, but it can do that because it puts the same exact pte value
(with the upgraded permissions) that was there before it put the
marker in the pte. In turn do_numa_page makes the pending TLB flush
becomes a noop and it doesn't need to wait for it before removing the
marker. Does that last difference in how the marker is cleared,
warrant to consider what NUMA balancing radically different so to
forbid userfaultfd_writeprotect to use the same logic in the very same
function with the very same lock in order to decrease the VM
complexity? In my view no.

Thanks,
Andrea