Re: [RFC PATCH 0/3] support large folio for mlock

From: Matthew Wilcox
Date: Fri Jul 07 2023 - 15:29:53 EST


On Fri, Jul 07, 2023 at 09:15:02PM +0200, David Hildenbrand wrote:
> > > Sure, any time we PTE-map a THP we might just say "let's put that on the
> > > deferred split queue" and cross fingers that we can eventually split it
> > > later. (I was recently thinking about that in the context of the mapcount
> > > ...)
> > >
> > > It's all a big mess ...
> >
> > Oh, I agree, there are always going to be circumstances where we realise
> > we've made a bad decision and can't (easily) undo it. Unless we have a
> > per-page pincount, and I Would Rather Not Do That.
>
> I agree ...
>
> But we should _try_
> > to do that because it's the right model -- that's what I meant by "Tell
>
> Try to have per-page pincounts? :/ or do you mean, try to split on VMA
> split? I hope the latter (although I'm not sure about performance) :)

Sorry, try to split a folio on VMA split.

> > me why I'm wrong"; what scenarios do we have where a user temporarilly
> > mlocks (or mprotects or ...) a range of memory, but wants that memory
> > to be aged in the LRU exactly the same way as the adjacent memory that
> > wasn't mprotected?
>
> Let me throw in a "fun one".
>
> Parent process has a 2 MiB range populated by a THP. fork() a child process.
> Child process mprotects half the VMA.
>
> Should we split the (COW-shared) THP? Or should we COW/unshare in the child
> process (ugh!) during the VMA split.
>
> It all makes my brain hurt.

OK, so this goes back to what I wrote earlier about attempting to choose
what size of folio to allocate on COW:

https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@xxxxxxxxxxxxxxxxxxxx/

: the parent had already established
: an appropriate size folio to use for this VMA before calling fork().
: Whether it is the parent or the child causing the COW, it should probably
: inherit that choice and we should default to the same size folio that
: was already found.

You've come up with a usefully different case here. I think we should
COW the folio at the point of the mprotect(). That will allow the parent
to become the sole owner of the folio once again and ensure that when
the parent modifies the folio, it _doesn't_ have to COW.

(This is also a rare case, surely)

> >
> > GUP-pinning is different, and I don't think GUP-pinning should split
> > a folio. That's a temporary use (not FOLL_LONGTERM), eg, we're doing
> > tcp zero-copy or it's the source/target of O_DIRECT. That's not an
> > instruction that this memory is different from its neighbours.
> >
> > Maybe we end up deciding to split folios on GUP-pin. That would be
> > regrettable.
>
> That would probably never be accepted, because the ones that heavily rely on
> THP (databases, VMs), typically also end up using a lot of features that use
> (long-term) page pinning. Don't get me started on io_uring with fixed
> buffers.

I do think that something like a long-term pin should split a folio.
Otherwise we're condemning the rest of the folio to be pinned along
with it. Short term pins shouldn't split.