IMHO, a relaxed form that focuses on only the memory consumption reduction
could *possibly* be accepted upstream if it's not too invasive or complex.
During fork(), we'd do exactly what we used to do to PTEs (increment
mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
duplicate swap entries; all while holding the page table lock), however,
sharing the prepared page table with the child process using COW after we
prepared it.
Any (most once we want to *optimize* rmap handling) modification attempts
require breaking COW -- copying the page table for the faulting process. But
at that point, the PTEs are already write-protected and properly accounted
(refcount/mapcount/PageAnonExclusive).
Doing it that way might not require any questionable GUP hacks and swapping,
MMU notifiers etc. "might just work as expected" because the accounting
remains unchanged" -- we simply de-duplicate the page table itself we'd have
after fork and any modification attempts simply replace the mapped copy.
Agree.
However for GUP hacks, if we want to do the COW to page table, we still
need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
check whether the PTE table is available or not before we do the COW to
the table). Otherwise, it will be more complicated since it might need
to handle situations like while preparing the COW work, it just figuring
out that it needs to duplicate the whole table and roll back (recover
the state and copy it to new table). Hopefully, I'm not wrong here.
But devil is in the detail (page table lock, TLB flushing).
Sure, it might be an overhead in the page fault and needs to be handled
carefully. ;)
"will make fork() even have more overhead" is not a good excuse for such
complexity/hacks -- sure, it will make your benchmark results look better in
comparison ;)
;);)
I think that, even if we do the accounting with the COW page table, it
still has a little bit improve.