Re: [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA

From: David Hildenbrand
Date: Tue Dec 06 2022 - 11:29:10 EST

Next message: Arnaldo Carvalho de Melo: "Re: [PATCH 2/3] perf build: Use libtraceevent from the system"
Previous message: Mark Rutland: "Re: [PATCH v3 7/8] perf: Add perf_event_attr::config3"
In reply to: kernel test robot: "Re: [PATCH] mm/uffd: Always wr-protect pte in pte_mkuffd_wp()"
Next in thread: Hugh Dickins: "Re: [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I think fundamentally the core is uffd-wp is pte-based, so the information
resides in pte not vma. I'm not strongly objecting this patch, especially
you mentioned auto-numa so I need to have a closer look later there.
However I do think uffd-wp is slightly special because we always need to
consider pte information anyway, so a per-vma information doesn't hugely
help, IMHO.

That's the same as softdirty tracking, IMHO.

Soft-dirty doesn't have a bit in the pte showing whether the page is
protected.

If the page is already softdirty (PTE bit), we don't need write faults and consequently can map the page writable. If the page is not softdirty, we need !write as default. But let's talk in detail about vma->vm_page_prot and interaction with other wrprotect features below.

One wr-protects in soft-dirty with either ALL or NONE. That's per-vma.

One wr-protects in uffd-wp by wr-protect specific page or range of pages.
That's per-page.

Let me try to explain clearer what the issue with uffd-wp on shmem is right now and how to proceed from here. maybe that makes it clearer what the vma->vm_page_prot semantics are.

vma->vm_page_prot defines the default PTE/PMD/... protection. This protection is always *safe*, meaning, if you blindly use that protection when (re)mapping a page, there won't be correctness issues. No need to manually wrprotect afterwards.

That's why migration, mprotect(), NUMA faults that all rely on vma->vm_page_prot work as expected for now.

When calculating vma->vm_page_prot, we considers three things:

(1) VMA protection. For example, no VM_WRITE? then it won't include
write permissions.
(2) Private vs. shared: If the mapping is private, we will never have
write permissions in vma->vm_page_prot, because we might have to
COW individual PTEs. We really want write faults as default.
(3) Write-notify: Some mechanism (softdirty tracking, file system, uffd-
wp) *might* require default write-protection, such that we always
properly trigger a write fault instead where we can handle these.
This only applies to shared mappings, because private mappings
always default to !write (2).

As vma->vm_page_prot is safe, it is always valid to apply them when mapping a PTE/PMD/ ... *in addition* we can use other information, to detect if we still can map the PTE writable (if they were removed due to (2) or (3)).

In migration code, we use the migration type (writable migration entries). In NUMA-hinting code we used the stored savedwrite value. Absence of these bits does not imply that we have to map the page wrprotected.

We never had to remove write permissions so far from the vma->vm_page_prot default. We always only added permissions.

Now, uffd-wp on shmem as of now violates these semantics. vma->vm_page_prot cannot always be used as a safe default, because we might have to wrprotect individual PTEs. Note that for uffd-wp on anonymous memory this was not an issue, because we default to !write in vma->vm_page_prot.

The two possible ways to approach this for uffd-wp on shmem are:

(1) Obey existing vma->vm_page_prot semantics. Default to !write and
optimize the relevant cases to *add* the write bit. This is
essentially what this patch does, minus
can_change_pte_writable() optimizations on relevant code paths where
we'd want to maintain the write bit. For example, when removing
uffd-wp protection we might want to restore the write bit directly.

(2) Default to write permissions and check each and every code location
where we remap for uffd-wp ptes, to manuall wrprotect -- *remove*
the write bit. Alternatively, as you said, always wrprotect when
setting the PTE bit, which might work as well.

My claim is that (1) is less error prone, because in the worst case we forget to optimize one code location -- instead to resulting in a BUG if we forget to wrprotect (what we have now). But I am not going to fight for it, because I can see that (2) can be made to work as well, as you outline in your patch.

You seem to have decided on (2). Fair enough, you know uffd-wp best. We just have to fix it properly and make the logic consistent whenever we remap a page.

Is anything I said fundamentally wrong?

[...]

Running the mprotect() reproducer [1] without this commit:
$ ./uffd-wp-mprotect
FAIL: uffd-wp did not fire
Running the mprotect() reproducer with this commit:
$ ./uffd-wp-mprotect
PASS: uffd-wp fired

[1] https://lore.kernel.org/all/222fc0b2-6ec0-98e7-833f-ea868b248446@xxxxxxxxxx/T/#u

I still hope for a formal patch (non-rfc) we can have a reproducer outside
mprotect(). IMHO mprotect() is really ambiguously here being used with
uffd-wp, so not a good example IMO as I explained in the other thread [1].

I took the low hanging fruit to showcase that this is a more generic problem.
The reproducer is IMHO nice because it's simple and race-free.

If no one is using mprotect() with uffd-wp like that, then the reproducer
may not be valid - the reproducer is defining how it should work, but does
that really stand? That's why I said it's ambiguous, because the
definition in this case is unclear.

There are interesting variations like:

mmap(PROT_READ, MAP_POPULATE|MAP_SHARED)
uffd_wp()
mprotect(PROT_READ|PROT_WRITE)

Where we start out with all-write permissions before we enable selective write permissions.

But I'm not going to argue about whats valid and whats not as long as it's documented ;). I primarily wanted to showcase that the same logic based on vma->vm_page_prot is used elsewhere, and that migration PTE restoration is not particularly special.

I think numa has the problem too which I agree with you. If you attach a
numa reproducer it'll be nicer. But again I'm not convinced uffd-wp is a
per-vma thing, which seems to be what this patch is based upon.

I might be able to give NUMA hinting a shot later this week, but I have other stuff to focus on.

Now I really wonder whether I should just simply wr-protect pte for
pte_mkuffd_wp() always, attached. I didn't do that from the start because
I wanted to keep the helpers operate on one bit only. But I found that
it's actually common technique to use in pgtable arch code, and it really
doesn't make sense to not wr-protect a pte if uffd-wp is set on a present
entry. It's much safer.

Indeed, that would be much safer.

I'll need to off-work most of the rest of today, but maybe I can also have
a look in the weekend or Monday more on the numa paths. Before that, can
we first reach a consensus that we have the mm/migrate patch there to be
merged first? These are two issues, IMHO.

I know you're against me for some reason, but until now I sincerely don't
know why. That patch sololy recovers write bit status (by removing it for
read-only) for a migration entry and that definitely makes sense to me. As
I also mentioned in the old version of that thread, we can rework migration
entries and merge READ|WRITE entries into a GENERIC entry one day if you
think proper, but that's for later.

I'm not against you. I'm against changing well-working, common code
when it doesn't make any sense to me to change it.

This goes back to the original question of whether we should remove the
write bit for read migration entry. Well, let's just focus on others;
we're all tired of this one.

I hope my lengthy explanation above was helpful and correct.

And now we have proof that
mprotect() just behaves exactly the same way, using the basic rules of vma->vm_page_prot.

Yes, there is broken sparc64 (below), but that shouldn't dictate our implementation.

I doubt whether sparc64 is broken if it has been like that anyway, because
I know little on sparc64 so I guess I'd not speak on that.

I'd wish some of the sparc64 maintainers would speak up finally. Anyhow, most probably I'll write a reproducer for some broken case and send it to the sparc64 list. Yeah, I need to get Linux for sparc64 running inside a VM ... :/

[...]

Yes, but now I'm prone to the patch I attached which should just cover all
pte_mkuffd_wp().

We should probably do the same for the pmd variants, and clean up the existing wrprotect like:
pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));

But that's of course, not a fix.

Side note: since looking at the numa code, I found that after the recent
rework of removing savedwrite for numa, cdb205f9e220 ("mm/autonuma: use
can_change_(pte|pmd)_writable() to replace savedwrite"), I think it can
happen that after numa balancing one pte under uffd-wp vma (but not
wr-protected) can have its write bit lost if the migration failed during
recovering, because vma_wants_manual_pte_write_upgrade() will return false
for such case. Is it true?

Yes, you are correct. I added that to the patch description:

"
Note that we don't optimize for the actual migration case:
(1) When migration succeeds the new PTE will not be writable because
the source PTE was not writable (protnone); in the future we
might just optimize that case similarly by reusing
can_change_pte_writable()/can_change_pmd_writable() when
removing migration PTEs.
(2) When migration fails, we'd have to recalculate the "writable"
flag because we temporarily dropped the PT lock; for now keep it
simple and set "writable=false".
"

Case (1) would, with your current patch, always lose the write bit during migration, even if vma->vm_page_prot included it. We most might want to optimize that in the future.

Case (2) is rather a corner case, and unless people complain about it being a real performance issue, it felt cleaner (less code) to not optimize for that now.

Again Peter, I am not against you, not at all. Sorry if I gave you the impression. I highly appreciate your work and this discussion.

--
Thanks,

David / dhildenb

Next message: Arnaldo Carvalho de Melo: "Re: [PATCH 2/3] perf build: Use libtraceevent from the system"
Previous message: Mark Rutland: "Re: [PATCH v3 7/8] perf: Add perf_event_attr::config3"
In reply to: kernel test robot: "Re: [PATCH] mm/uffd: Always wr-protect pte in pte_mkuffd_wp()"
Next in thread: Hugh Dickins: "Re: [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]