Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table

From: David Hildenbrand
Date: Wed Sep 28 2022 - 10:03:36 EST


On 27.09.22 21:53, Chih-En Lin wrote:
On Tue, Sep 27, 2022 at 06:38:05PM +0000, Nadav Amit wrote:
I only skimmed the patches that you sent. The last couple of patches seem a
bit rough and dirty, so I am sorry to say that I skipped them (too many
“TODO” and “XXX” for my taste).

I am sure other will have better feedback than me. I understand there is a
tradeoff and that this mechanism is mostly for high performance
snapshotting/forking. It would be beneficial to see whether this mechanism
can somehow be combined with existing ones (mshare?).

Still thanks for your feedback. :)
I'm looking at the PTE refcount and mshare patches. And, maybe it can
combine with them in the future.

The code itself can be improved. I found the reasoning about synchronization
and TLB flushes and synchronizations to be lacking, and the code to seem
potentially incorrect. Better comments would help, even if the code is
correct.

There are additional general questions. For instance, when sharing a
page-table, do you properly update the refcount/mapcount of the mapped
pages? And are there any possible interactions with THP?

Since access to those mapped pages will cost a lot of time, and this
will make fork() even have more overhead. It will not update the
refcount/mapcount of the mapped pages.

Oh no.

So we'd have pages logically mapped into two processes (two page table structures), but the refcount/mapcount/PageAnonExclusive would not reflect that?

Honestly, I don't think it is upstream material in that hacky form. No, we don't need more COW CVEs or more COW over-complications that destabilize the whole system.

IMHO, a relaxed form that focuses on only the memory consumption reduction could *possibly* be accepted upstream if it's not too invasive or complex. During fork(), we'd do exactly what we used to do to PTEs (increment mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O, duplicate swap entries; all while holding the page table lock), however, sharing the prepared page table with the child process using COW after we prepared it.

Any (most once we want to *optimize* rmap handling) modification attempts require breaking COW -- copying the page table for the faulting process. But at that point, the PTEs are already write-protected and properly accounted (refcount/mapcount/PageAnonExclusive).

Doing it that way might not require any questionable GUP hacks and swapping, MMU notifiers etc. "might just work as expected" because the accounting remains unchanged" -- we simply de-duplicate the page table itself we'd have after fork and any modification attempts simply replace the mapped copy.

But devil is in the detail (page table lock, TLB flushing).

"will make fork() even have more overhead" is not a good excuse for such complexity/hacks -- sure, it will make your benchmark results look better in comparison ;)

--
Thanks,

David / dhildenb