Re: [RFC PATCH 0/3] support large folio for mlock

From: David Hildenbrand
Date: Fri Jul 07 2023 - 15:15:52 EST


On 07.07.23 21:06, Matthew Wilcox wrote:
On Fri, Jul 07, 2023 at 08:54:33PM +0200, David Hildenbrand wrote:
On 07.07.23 19:26, Matthew Wilcox wrote:
On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote:
This series identified the large folio for mlock to two types:
- The large folio is in VM_LOCKED VMA range
- The large folio cross VM_LOCKED VMA boundary

This is somewhere that I think our fixation on MUST USE PMD ENTRIES
has led us astray. Today when the arguments to mlock() cross a folio
boundary, we split the PMD entry but leave the folio intact. That means
that we continue to manage the folio as a single entry on the LRU list.
But userspace may have no idea that we're doing this. It may have made
several calls to mmap() 256kB at once, they've all been coalesced into
a single VMA and khugepaged has come along behind its back and created
a 2MB THP. Now userspace calls mlock() and instead of treating that as
a hint that oops, maybe we shouldn't've done that, we do our utmost to
preserve the 2MB folio.

I think this whole approach needs rethinking. IMO, anonymous folios
should not cross VMA boundaries. Tell me why I'm wrong.

I think we touched upon that a couple of times already, and the main issue
is that while it sounds nice in theory, it's impossible in practice.

THP are supposed to be transparent, that is, we should not let arbitrary
operations fail.

But nothing stops user space from

(a) mmap'ing a 2 MiB region
(b) GUP-pinning the whole range
(c) GUP-pinning the first half
(d) unpinning the whole range from (a)
(e) munmap'ing the second half


And that's just one out of many examples I can think of, not even
considering temporary/speculative references that can prevent a split at
random points in time -- especially when splitting a VMA.

Sure, any time we PTE-map a THP we might just say "let's put that on the
deferred split queue" and cross fingers that we can eventually split it
later. (I was recently thinking about that in the context of the mapcount
...)

It's all a big mess ...

Oh, I agree, there are always going to be circumstances where we realise
we've made a bad decision and can't (easily) undo it. Unless we have a
per-page pincount, and I Would Rather Not Do That.

I agree ...

But we should _try_
to do that because it's the right model -- that's what I meant by "Tell

Try to have per-page pincounts? :/ or do you mean, try to split on VMA split? I hope the latter (although I'm not sure about performance) :)

me why I'm wrong"; what scenarios do we have where a user temporarilly
mlocks (or mprotects or ...) a range of memory, but wants that memory
to be aged in the LRU exactly the same way as the adjacent memory that
wasn't mprotected?

Let me throw in a "fun one".

Parent process has a 2 MiB range populated by a THP. fork() a child process. Child process mprotects half the VMA.

Should we split the (COW-shared) THP? Or should we COW/unshare in the child process (ugh!) during the VMA split.

It all makes my brain hurt.


GUP-pinning is different, and I don't think GUP-pinning should split
a folio. That's a temporary use (not FOLL_LONGTERM), eg, we're doing
tcp zero-copy or it's the source/target of O_DIRECT. That's not an
instruction that this memory is different from its neighbours.

Maybe we end up deciding to split folios on GUP-pin. That would be
regrettable.

That would probably never be accepted, because the ones that heavily rely on THP (databases, VMs), typically also end up using a lot of features that use (long-term) page pinning. Don't get me started on io_uring with fixed buffers.

--
Cheers,

David / dhildenb