Re: [PATCH v3 1/3] mm/khugepaged: Take the right locks for page table retraction

From: Jann Horn
Date: Mon Nov 28 2022 - 12:29:05 EST


On Mon, Nov 28, 2022 at 2:53 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> On 25.11.22 22:37, Jann Horn wrote:
> > pagetable walks on address ranges mapped by VMAs can be done under the mmap
> > lock, the lock of an anon_vma attached to the VMA, or the lock of the VMA's
> > address_space. Only one of these needs to be held, and it does not need to
> > be held in exclusive mode.
> >
> > Under those circumstances, the rules for concurrent access to page table
> > entries are:
> >
> > - Terminal page table entries (entries that don't point to another page
> > table) can be arbitrarily changed under the page table lock, with the
> > exception that they always need to be consistent for
> > hardware page table walks and lockless_pages_from_mm().
> > This includes that they can be changed into non-terminal entries.
> > - Non-terminal page table entries (which point to another page table)
> > can not be modified; readers are allowed to READ_ONCE() an entry, verify
> > that it is non-terminal, and then assume that its value will stay as-is.
> >
> > Retracting a page table involves modifying a non-terminal entry, so
> > page-table-level locks are insufficient to protect against concurrent
> > page table traversal; it requires taking all the higher-level locks under
> > which it is possible to start a page walk in the relevant range in
> > exclusive mode.
> >
> > The collapse_huge_page() path for anonymous THP already follows this rule,
> > but the shmem/file THP path was getting it wrong, making it possible for
> > concurrent rmap-based operations to cause corruption.
>
> This sounds sane and correct to me. No expert on file-THP, though.
>
> For anon-THP it's the mmap lock and the rmap locks. I assume the only
> difference for file-THP is that the rmap lock is actually the mapping
> lock. Looking at rmap_walk_file(), that seems to be the case.

Yeah. You can also have private file VMAs that are associated with
both a mapping and a set of anon_vmas, and in that case you would need
to lock the mmap, the mapping, and the anon_vma root; but the file THP
code in khugepaged instead just bails on file VMAs with an anon_vma.

> I wish at least PTE table removal could be done easier ... I already
> experimented some time ago with some ideas (e.g., lock in PMD table
> memmap) but it's all far from trivial and space in the memmap is rare.

Because you want it to be faster? Is that for the THP usecase or something else?