Re: [PATCH v2 1/4] mm/mremap: Optimize the start addresses in move_page_tables()

From: Joel Fernandes
Date: Fri May 19 2023 - 23:18:56 EST


Hi Linus,

On Fri, May 19, 2023 at 10:34 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, May 19, 2023 at 3:52 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> > >
> > > I *suspect* that the test is literally just for the stack movement
> > > case by execve, where it catches the case where we're doing the
> > > movement entirely within the one vma we set up.
> >
> > Yes that's right, the test is only for the stack movement case. For
> > the regular mremap case, I don't think there is a way for it to
> > trigger.
>
> So I feel the test is simply redundant.
>
> For the regular mremap case, it never triggers.

Unfortunately, I just found that mremap-ing a range purely within a
VMA can actually cause the old and new VMA passed to
move_page_tables() to be the same.

I added a printk to the beginning of move_page_tables that prints all the args:
printk("move_page_tables(vma=(%lx,%lx), old_addr=%lx,
new_vma=(%lx,%lx), new_addr=%lx, len=%lx)\n", vma->vm_start,
vma->vm_end, old_addr, new_vma->vm_start, new_vma->vm_end, new_addr,
len);

Then I wrote a simple test to move 1MB purely within a 10MB range and
I found on running the test that the old and new vma passed to
move_page_tables() are exactly the same.

[ 19.697596] move_page_tables(vma=(7f1f985f7000,7f1f98ff7000),
old_addr=7f1f987f7000, new_vma=(7f1f985f7000,7f1f98ff7000),
new_addr=7f1f98af7000, len=100000)

That is a bit counter intuitive as I really thought we'd be splitting
the VMAs with such a move. Any idea what am I missing?

Also, such a usecase will break with my patch as we may accidentally
overwrite parts of a range that were not part of the mremap request.
Maybe I should just turn off the optimization if vma == new_vma,
however that will also turn it off for the stack move so then maybe
another way is to special case stack moves in move_page_tables().

So this means I have to go back to the drawing board a bit on this
patch, and also add more tests in mremap_test.c to test such
within-VMA moving. I believe there are no such existing tests... More
work to do for me. :-)

> And for the stack movement case by execve, I don't think it matters if
> you just were to change the logic of the subsequent checks a bit.
>
> In particular, you do this:
>
> /* If the masked address is within vma, there is no prev
> mapping of concern. */
> if (vma->vm_start <= addr_masked)
> return false;
>
> /*
> * Attempt to find vma before prev that contains the address.
> * On any issue, assume the address is within a previous mapping.
> * @mmap write lock is held here, so the lookup is safe.
> */
> cur = find_vma_prev(vma->vm_mm, vma->vm_start, &prev);
> if (!cur || cur != vma || !prev)
> return true;
> /* The masked address fell within a previous mapping. */
> if (prev->vm_end > addr_masked)
> return true;
>
> return false;
>
> And I think that
>
> if (!cur || cur != vma || !prev)
> return true;
>
> is actively wrong, because if there is no 'prev', then you should return false.

During my tests, I observed that there was always an existing,
unrelated memory mapping present prior to the new memory region
allocated by mmap. Based on this observation, I concluded that if
there is no previous mapping (i.e., if prev is NULL), it indicates a
potential issue with find_vma_prev(). Therefore, I designed this
function to return here indicating that the masked address is not
suitable for optimization, whenever prev is NULL.

That's obviously confusing so I'll try to rewrite this part of the
patch a bit better with appropriate comments.

> So I *think* all of the above could just be replaced with this instead:
>
> find_vma_prev(vma->vm_mm, vma->vm_start, &prev);
> return prev && prev->vm_end > addr_masked;
>
> because only if we have a 'prev', and the prev is into that masked
> address, do we need to avoid doing the masking.
>
> With that simplified test, do you even care about that whole "the
> masked address was already in the vma"? Not that I can see.
>
> And we don't even care about the return value of 'find_vma_prev()',
> because it had better be 'vma'. We're giving it 'vma->vm_start' as an
> address, for chrissake!
>
> So if you *really* wanted to, you could do something like
>
> cur = find_vma_prev(..);
> if (WARN_ON_ONCE(cut != vma))
> return true;
>
> but even that WARN_ON_ONCE() seems pretty bogus. If it triggers, we
> have some serious corruption going on.
>
> So I stil find that whole "vma->vm_start <= addr_masked" test a bit
> confusing, since it seems entirely redundant.
>
> Is it just because you wanted to avoid calling "find_vma_prev()" at
> all? Maybe just say that in the comment.

Yes exactly, I did not want to run find_vma_prev() unnecessarily. I
will add such clarifications in the comments.

Thanks for all the comments so far, I will continue to work on this.

- Joel