Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table

From: David Hildenbrand
Date: Thu Sep 29 2022 - 15:00:30 EST


On 29.09.22 20:57, Chih-En Lin wrote:
On Thu, Sep 29, 2022 at 08:38:52PM +0200, David Hildenbrand wrote:
On 29.09.22 20:29, Chih-En Lin wrote:
On Thu, Sep 29, 2022 at 07:24:31PM +0200, David Hildenbrand wrote:
IMHO, a relaxed form that focuses on only the memory consumption reduction
could *possibly* be accepted upstream if it's not too invasive or complex.
During fork(), we'd do exactly what we used to do to PTEs (increment
mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
duplicate swap entries; all while holding the page table lock), however,
sharing the prepared page table with the child process using COW after we
prepared it.

Any (most once we want to *optimize* rmap handling) modification attempts
require breaking COW -- copying the page table for the faulting process. But
at that point, the PTEs are already write-protected and properly accounted
(refcount/mapcount/PageAnonExclusive).

Doing it that way might not require any questionable GUP hacks and swapping,
MMU notifiers etc. "might just work as expected" because the accounting
remains unchanged" -- we simply de-duplicate the page table itself we'd have
after fork and any modification attempts simply replace the mapped copy.

Agree.
However for GUP hacks, if we want to do the COW to page table, we still
need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
check whether the PTE table is available or not before we do the COW to
the table). Otherwise, it will be more complicated since it might need
to handle situations like while preparing the COW work, it just figuring
out that it needs to duplicate the whole table and roll back (recover
the state and copy it to new table). Hopefully, I'm not wrong here.

The nice thing is that GUP itself *usually* doesn't modify page tables. One
corner case is follow_pfn_pte(). All other modifications should happen in
the actual fault handler that has to deal with such kind of unsharing either
way when modifying the PTE.

If the pages are already in a COW-ed pagetable in the desired "shared" state
(e.g., PageAnonExclusive cleared on an anonymous page), R/O pinning of such
pages will just work as expected and we shouldn't be surprised by another
set of GUP+COW CVEs.

We'd really only deduplicate the page table and not play other tricks with
the actual page table content that differ from the existing way of handling
fork().

I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code when
not modifying the page table. I think we only need "we have to unshare this
page table now" in follow_pfn_pte() and inside the fault handling when GUP
triggers a fault.

I hope my assumption is correct, or am I missing something?


My consideration is when we pinned the page and did the COW to make the
page table be shared. It might not allow mapping the pinned page to R/O)
into both processes.

So, if the fork is working on the shared state, it needs to recover the
table and copy to a new one since that pinned page will need to copy
immediately. We can hold the shared state after occurring such a
situation. So we still need some trick to let the fork() know which page
table already has the pinned page (or such page won't let us share)
before going to duplicate.

Am I wrong here?

I think you might be overthinking this. Let's keep it simple:

1) Handle pinned anon pages just as I described below, falling back to the
"slow" path of page table copying.

2) Once we passed that stage, you can be sure that the COW-ed page table
cannot have actually pinned anon pages. All anon pages in such a page table
have PageAnonExclusive cleared and are "maybe shared". GUP cannot succeed in
pinning these pages anymore, because it will only pin exclusive anon pages!

3) If anybody wants to take a R/O pin on a shared anon page that is mapped
into a COW-ed page table, we trigger a fault with FAULT_FLAG_UNSHARE instead
of pinning the page. This has to break COW on the page table and properly
map an exclusive anon page into it, breaking COW.

Do you see a problem with that?


After that, since we handled the accounting in fork(), we don't need
ownership (pmd_t pointer) anymore. We have to find another way to mark
the table to be exclusive. (Right now, COW_PTE_OWNER_EXCLUSIVE flag is
stored at that space.)


But devil is in the detail (page table lock, TLB flushing).

Sure, it might be an overhead in the page fault and needs to be handled
carefully. ;)

"will make fork() even have more overhead" is not a good excuse for such
complexity/hacks -- sure, it will make your benchmark results look better in
comparison ;)

;);)
I think that, even if we do the accounting with the COW page table, it
still has a little bit improve.

:)

My gut feeling is that this is true. While we have to do a pass over the
parent page table during fork and wrprotect all PTEs etc., we don't have to
duplicate the page table content and allocate/free memory for that.

One interesting case is when we cannot share an anon page with the child
process because it maybe pinned -- and we have to copy it via
copy_present_page(). In that case, the page table between the parent and the
child would differ and we'd not be able to share the page table.

That is what I want to say above.
The case might happen in the middle of the shared page table progress.
It might cost more overhead to recover it. Therefore, if GUP wants to
pin the mapped page we can mark the PTE table first, so fork() won't
waste time doing the work for sharing.

Having pinned pages is a corner case for most apps. No need to worry about
optimizing this corner case for now.

I see what you are trying to optimize, but I don't think this is needed in a
first version, and probably never is needed.


Any attempts to mark page tables in a certain way from GUP
(COW_PTE_OWNER_EXCLUSIVE) is problematic either way: GUP-fast
(get_user_pages_fast) can race with pretty much anything, even with
concurrent fork. I suspect your current code might be really racy in that
regard.

I see.
Now, I know why optimizing that corner case is not worth it.
Thank you for explaining that.

Falling back after already processing some PTEs requires some care, though. I guess it's not too hard to get it right -- it might be harder to get it "clean". But we can talk about that detail later.

--
Thanks,

David / dhildenb