Re: [RFC PATCH 0/3] support large folio for mlock

From: David Hildenbrand
Date: Mon Jul 10 2023 - 06:00:34 EST


On 10.07.23 11:43, Yin, Fengwei wrote:
Hi David,

On 7/10/2023 5:32 PM, David Hildenbrand wrote:
On 09.07.23 15:25, Yin, Fengwei wrote:


On 7/8/2023 12:02 PM, Matthew Wilcox wrote:
I would be tempted to allocate memory & copy to the new mlocked VMA.
The old folio will go on the deferred_list and be split later, or its
valid parts will be written to swap and then it can be freed.
If the large folio splitting failure is because of GUP pages, can we
do copy here?

Let's say, if the GUP page is target of DMA operation and DMA operation
is ongoing. We allocated a new page and copy GUP page content to the
new page, the data in the new page can be corrupted.

No, we may only replace anon pages that are flagged as maybe shared (!PageAnonExclusive). We must not replace pages that are exclusive (PageAnonExclusive) unless we first try marking them maybe shared. Clearing will fail if the page maybe pinned.
Thanks a lot for clarification.

So my understanding is that if large folio splitting fails, it's not always
true that we can allocate new folios, copy original large folio content to
new folios, remove original large folio from VMA and map the new folios to
VMA (like it's only true if original large folio is marked as maybe shared).


While it might work in many cases, there are some corner cases where it won't work.

So to summarize

(1) THP are transparent and should not result in arbitrary syscall
failures.
(2) Splitting a THP might fail at random points in time either due to
GUP pins or due to speculative page references (including
speculative GUP pins).
(3) Replacing an exclusive anon page that maybe pinned will result in
memory corruptions.

So we can try to split any THP that crosses VMA borders on VMA modifications (split due to munmap, mremap, madvise, mprotect, mlock, ...), it's not guaranteed to work due to (1). And we can try to replace pages such pages, but it's not guaranteed to be allowed due to (3).

And as it's all transparent, we cannot fail (1).

For the other cases that Willy and I discussed (split on VMA modifications after fork()), we can at least always replace the anon page.

<details>

What always works, is putting the THP on the deferred split queue to see if we can split it later. The deferred split queue is a bit suboptimal right now, because it requires the (sub)page mapcounts to detect whether the folio is partially mapped vs. fully mapped. If we want to get rid of that, we have to come up with something reasonable.

I was wondering if we could have a an optimized deferred split queue, that only conditionally splits: do an rmap walk and detect if (a) each page of the folio is still mapped (b) the folio does not cross a VMA. If both are met, one could skip the deferred split. But that needs a bit of thought -- but we're already doing an rmap walk when splitting, so scanning which parts are actually mapped does not sound too weird.

</details>

--
Cheers,

David / dhildenb