Re: [PATCH v3 15/15] mm/mmap: Change vma iteration order in do_vmi_align_munmap()

From: Jann Horn
Date: Tue Aug 15 2023 - 10:21:08 EST


On Tue, Aug 15, 2023 at 9:29 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
>
> * Jann Horn <jannh@xxxxxxxxxx> [230814 17:22]:
> > On Mon, Aug 14, 2023 at 10:32 PM Liam R. Howlett
> > <Liam.Howlett@xxxxxxxxxx> wrote:
> > > * Jann Horn <jannh@xxxxxxxxxx> [230814 11:44]:
> > > > @akpm
> > > >
> > > > On Mon, Jul 24, 2023 at 8:31 PM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
> > > > > Since prev will be set later in the function, it is better to reverse
> > > > > the splitting direction of the start VMA (modify the new_below argument
> > > > > to __split_vma).
> > > >
> > > > It might be a good idea to reorder "mm: always lock new vma before
> > > > inserting into vma tree" before this patch.
> > > >
> > > > If you apply this patch without "mm: always lock new vma before
> > > > inserting into vma tree", I think move_vma(), when called with a start
> > > > address in the middle of a VMA, will behave like this:
> > > >
> > > > - vma_start_write() [lock the VMA to be moved]
> > > > - move_page_tables() [moves page table entries]
> > > > - do_vmi_munmap()
> > > > - do_vmi_align_munmap()
> > > > - __split_vma()
> > > > - creates a new VMA **covering the moved range** that is **not locked**
> > > > - stores the new VMA in the VMA tree **without locking it** [1]
> > > > - new VMA is locked and removed again [2]
> > > > [...]
> > > >
> > > > So after the page tables in the region have already been moved, I
> > > > believe there will be a brief window (between [1] and [2]) where page
> > > > faults in the region can happen again, which could probably cause new
> > > > page tables and PTEs to be created in the region again in that window.
> > > > (This can't happen in Linus' current tree because the new VMA created
> > > > by __split_vma() only covers the range that is not being moved.)
> > >
> > > Ah, so my reversing of which VMA to keep to the first split call opens a
> > > window where the VMA being removed is not locked. Good catch.
>
> Looking at this again, I think it exists in Linus' tree and my change
> actually removes this window:
>
> - error = __split_vma(vmi, vma, start, 0);
> + error = __split_vma(vmi, vma, start, 1);
> if (error)
> goto start_split_failed;
>
> The last argument is "new_below", which means the new VMA will be at the
> lower address. I don't love the argument of int or the name, also the
> documentation is lacking for the split function.
>
> So, once we split at "start", vm_end = "start" in the new VMA while
> start will be in the old VMA. I then lock the old vma to be removed
> (again) and add it to the detached maple tree.
>
> Before my patch, we split the VMA and took the new unlocked VMA for
> removal.. until I locked the new vma to be removed and add it to the
> detached maple tree. So there is a window that we write the new split
> VMA into the tree prior to locking the VMA, but it is locked before
> removal.
>
> This change actually aligns the splitting with the other callers who use
> the split_vma() wrapper.

Oooh, you're right. Sorry, I misread that.

> > >
> > > >
> > > > Though I guess that's not going to lead to anything bad, since
> > > > do_vmi_munmap() anyway cleans up PTEs and page tables in the region?
> > > > So maybe it's not that important.
> > >
> > > do_vmi_munmap() will clean up PTEs from the end of the previous VMA to
> > > the start of the next
> >
> > Alright, I guess no action is needed here then.
>
> I don't see a difference between this and the race that exists after the
> page fault ends and a task unmaps the area prior to the first task using
> the faulted areas?

Yeah, you're right. I was thinking about it in terms of "this is a
weird situation and it would be dangerous if something relied on there
not being any non-empty PTEs in the range", but there's nothing that
relies on it, so that's fine.