Re: [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA

From: David Hildenbrand
Date: Fri Dec 02 2022 - 12:12:23 EST


On 02.12.22 17:56, David Hildenbrand wrote:
On 02.12.22 17:33, Peter Xu wrote:
On Fri, Dec 02, 2022 at 01:27:48PM +0100, David Hildenbrand wrote:
Currently, we don't enable writenotify when enabling userfaultfd-wp on
a shared writable mapping (for now we only support SHMEM). The consequence

and hugetlbfs

is that vma->vm_page_prot will still include write permissions, to be set
as default for all PTEs that get remapped (e.g., mprotect(), NUMA hinting,
page migration, ...).

The thing is by default I think we want the write bit..

The simple example is (1) register UFFD_WP on shmem writable, (2) write a
page. Here we didn't wr-protect anything, so we want the write bit there.

Or the other example is when UFFDIO_COPY with flags==0 even if with
VM_UFFD_WP. We definitely wants the write bit.

We only doesn't want the write bit when uffd-wp is explicitly set.

I think fundamentally the core is uffd-wp is pte-based, so the information
resides in pte not vma. I'm not strongly objecting this patch, especially
you mentioned auto-numa so I need to have a closer look later there.
However I do think uffd-wp is slightly special because we always need to
consider pte information anyway, so a per-vma information doesn't hugely
help, IMHO.

That's the same as softdirty tracking, IMHO.

[...]

Running the mprotect() reproducer [1] without this commit:
$ ./uffd-wp-mprotect
FAIL: uffd-wp did not fire
Running the mprotect() reproducer with this commit:
$ ./uffd-wp-mprotect
PASS: uffd-wp fired

[1] https://lore.kernel.org/all/222fc0b2-6ec0-98e7-833f-ea868b248446@xxxxxxxxxx/T/#u

I still hope for a formal patch (non-rfc) we can have a reproducer outside
mprotect(). IMHO mprotect() is really ambiguously here being used with
uffd-wp, so not a good example IMO as I explained in the other thread [1].

I took the low hanging fruit to showcase that this is a more generic problem.
The reproducer is IMHO nice because it's simple and race-free.


I'll need to off-work most of the rest of today, but maybe I can also have
a look in the weekend or Monday more on the numa paths. Before that, can
we first reach a consensus that we have the mm/migrate patch there to be
merged first? These are two issues, IMHO.

I know you're against me for some reason, but until now I sincerely don't
know why. That patch sololy recovers write bit status (by removing it for
read-only) for a migration entry and that definitely makes sense to me. As
I also mentioned in the old version of that thread, we can rework migration
entries and merge READ|WRITE entries into a GENERIC entry one day if you
think proper, but that's for later.

I'm not against you. I'm against changing well-working, common code
when it doesn't make any sense to me to change it. And now we have proof that
mprotect() just behaves exactly the same way, using the basic rules of vma->vm_page_prot.

Yes, there is broken sparc64 (below), but that shouldn't dictate our implementation.


What *would* make sense to me, as I raised, is:

diff --git a/mm/migrate.c b/mm/migrate.c
index dff333593a8a..9fc181fd3c5a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -213,8 +213,10 @@ static bool remove_migration_pte(struct folio *folio,
pte = pte_mkdirty(pte);
if (is_writable_migration_entry(entry))
pte = maybe_mkwrite(pte, vma);
- else if (pte_swp_uffd_wp(*pvmw.pte))
+ else if (pte_swp_uffd_wp(*pvmw.pte)) {
pte = pte_mkuffd_wp(pte);
+ pt = pte_wrprotect(pte);
+ }
if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
rmap_flags |= RMAP_EXCLUSIVE;


It still requires patch each and every possible code location, which I dislike as
described in the patch description. The fact that there are still uffd-wp bugs
with your patch makes that hopefully clear. I'd be interested if they can be
reproduced witht his patch.


And if NUMA hinting is indeed the problem, without this patch what would
be required would most probably be:


diff --git a/mm/memory.c b/mm/memory.c
index 8a6d5c823f91..869d35ef0e24 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4808,6 +4808,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
pte = pte_mkyoung(pte);
if (was_writable)
pte = pte_mkwrite(pte);
+ if (pte_uffd_wp(pte))
+ pte = pte_wrprotect(pte);
ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);


And just to make my point about the migration path clearer: doing it your way
would be:

diff --git a/mm/memory.c b/mm/memory.c
index 8a6d5c823f91..a7c4c1a57f6a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4808,6 +4808,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
pte = pte_mkyoung(pte);
if (was_writable)
pte = pte_mkwrite(pte);
+ else
+ pte = pte_wrprotect(pte);
ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);


And I don't think that's the right approach.


--
Thanks,

David / dhildenb