Re: [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA

From: David Hildenbrand
Date: Wed Dec 07 2022 - 14:54:38 EST



On upstream during the next write fault, we'll end up in do_numa_page() and
simply remap the page writable due to vm_page_prot, not triggering a write
fault. I can see the "numa_hint_faults" counter in /proc/vmstat increasing
accordingly, so we're really in do_numa_page().

Seems true. I think fundamentally it's because numa hint rely on PROT_NONE
as the hint, and it explicitly checks against mprotect(PROT_NONE) using the
accessible check:

if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);

I'm not sure whether we should also add a pte_uffd_wp(vmf->orig_pte) here
to mask out the uffd-wp cases.

:/ more special UFFD-wp casing, I'm not sure sure about that.

Most importantly, once someone unlocks NUMA hinting for shmem (e.g., MPOL_MF_LAZY, MPOL_F_NUMA_BALANCING) this might be problematic. That at least makes it sound fragile to me.


So far it seems the outcome is not extremely bad - PROT_WRITE only mappings
are rare in real life, and also with the protnone recovery code (and along
with the vm_page_prot patch coming) we'll be able to still recover the pte
into a uffd-wp-ed pte without PROTNONE bit set. But I don't have a solid
clue yet on what's the best.


Yes, just another way to trigger surprise uffd-wp behavior (at least surprising for me ;) ). But this time, not involving mprotect(). I suspect there are more cases, but I might be wrong.

I was primarily trying to find out which other cases might be affected.

[..]



Independent of uffd-wp on shmem, we seem to be missing propagating the
uffd-wp bit when splitting the huge zeropage. So uffd-wp'ing the huge
zeropage and then splitting it loses the uffd-wp markers. :/

For this one, thanks for the reproducer. I'm not extremely sure whether
it's a bug.

Firstly, I think your reproducer should just work well with shmem, afaiu,
because shmem is based on pte markers and it should only work on pte level
(not pmd). The huge zero pmd should got split right after wr-protected.
So the reproducer shouldn't go wrong on shmem at all with/without any
recent fix. Let me know otherwise.

shmem doesn't use the huge shared zeropage, so it should be fine.


For anon, I'm not sure it's a bug, because there's a semantic difference on
anon/shmem. The thing is losing wr-protect on the zero page is the same as
losing wr-protect on a page that is not mapped. For anon currently we
can't track a page that is not mapped and we skip those ranges (being zero
when read). So fundamentally I am not sure whether it'll be an issue for
existing anon uffd-wp users because if it is then it's more than zero
pages.

I think it's a bug, although most probably a low priority one.

Once user space successfully placed an uffd-wp marker, and e.g., verified using pagemap that it is indeed placed, the system should not silently drop it.

The behavior between an ordinary THP and a huge zeropage differs. For THP, we handle the split correctly and don't lose the marker. Assuming the huge zeropage woud be disabled, the behavior would be (IMHO) correct. The test case would pass.

For example, QEMU with uffd-wp based snapshotting will make sure that all virtual addresses are populated (e.g., mapping the shared, eventually the huge zeropage -- populate_read_range()), before protecting using uffd-wp. Losing a uffd-wp marker would be problematic.

The good news is that we barely will end up PTE-mapping the huge zeropage unless there is real user-space interaction (mprotect(), mremap(), mmap()), so this shouldn't trigger in the QEMU use-case.


Anyhow, I'll send a patch in a couple of days and we can discuss further. It's independent of the other discussion, just wanted to report my findings after staring at that code for way too long today.

--
Thanks,

David / dhildenb