Re: RFC for new feature to move pages from one vma to another without split

From: Lorenzo Stoakes
Date: Wed Jun 07 2023 - 16:19:11 EST


On Thu, Apr 13, 2023 at 10:10:44AM +0200, David Hildenbrand wrote:
> For RMAP and friends (relying on linear_page_index), folio->index has to
> match the index within the VMA. If would set pgoff to something else, we'd
> have less VMA merging opportunities. So your system might work, but you'd
> end up with many anon VMAs.

I thik the reverse situation, i.e. splitting the VMA, is the more serious
one, and without a correct index would simply break rmap.

Consider:-

[ VMA ]
^
|
[ avc ]
^
|
[ anon_vma ]
^ ^ ^
/ | \
page 1 page 2 page 3

If we unmap page 2, we cannot (or would rather not) update page 1 and page
3 to point to a new anon_vma and instead end up with:-

[ VMA 1 ] [ VMA 3 ]
^ ^
| |
[ avc ] [ avc ]
^ ^
\ /
[ anon_vma ]
^ ^
/ \
page 1 page 3

Now you need some means of knowing which VMA each belongs to - we have to
use the folio->index to look up which anon_vma_chain (avc) in the
anon_vma's interval tree (which is keyed on folio->index) contains its VMA
(actually this could be multiple VMAs due to forking).

mremap() seems to me to be a lot of the reason we don't just put
vma->vm_start >> PAGE_SHIFT in folio->index the fly, as when a block of
memory is moved, we don't want to have to go and update all of the
underlying pages, so we just keep the vm_pgoff the same as the old position
even after it's moved. We keep this in vm_pgoff so we know what pgoff's to
give to new pages to put in their index fields.

As a result, we obviously wouldn't want to merge an mremap'd VMA with that
special handling with one that didn't have it to avoid the pages not being
able to be rmap'd back to the correct VMAs, so requiring vm_pgoff to be
linearly monotonically increasing across the merged range achieves this.

Doing it this way keeps the code for the VMA manipulation logic the same
for file-backed and anon mappings so is (kind of) neat in that respect.

Oh as a point of interest there is _yet another_ thing that can go in
vm_pgoff, which is remapped kernel mappings via remap_pfn_range_notrack()
which puts PFN in there :))

(as you can imagine I've torn out my rapidly diminishing hair writing about
this stuff in the book)

>
>
> Imagine the following:
>
> [ anon0 ][ fd ][ anon1 ]
>
> Unmap the fd:
>
> [ anon0 ][ hole ][ anon1 ]
>
> Mmap anon:
>
> [ anon0 ][ anon2 ][ anon1 ]
>
>
> We can now merge all 3 VMAs into one, even if the first and latter already
> map pages.
>
>
> A simpler and more common example is probably:
>
> [ anon0 ]
>
> Mmmap anon1 before the existing one
>
> [ anon1 ][ anon0 ]
>
> Which we can merge into a single one.
>
>
>
> Mapping after an existing one could work, but one would have to carefully
> set pgoff based on the size of the previous anon VMA ... which is more
> complicated
>
> So instead, we consider the whole address space as a virtual, anon file,
> starting at offset 0. The pgoff of a VMA is then simply the offset in that
> virtual file (easily computed from the start of the VMA), and VMA merging is
> just the same as for an ordinary file.

This is a very good way of explaining it (though mremap complicates things
somewhat).

>
> --
> Thanks,
>
> David / dhildenb
>