Re: [PATCH] mm/khugepaged: skip shmem with armed userfaultfd

From: Peter Xu
Date: Wed Feb 01 2023 - 15:54:44 EST


On Wed, Feb 01, 2023 at 09:36:37AM -0800, Yang Shi wrote:
> On Tue, Jan 31, 2023 at 7:42 PM David Stevens <stevensd@xxxxxxxxxxxx> wrote:
> >
> > From: David Stevens <stevensd@xxxxxxxxxxxx>
> >
> > Collapsing memory in a vma that has an armed userfaultfd results in
> > zero-filling any missing pages, which breaks user-space paging for those
> > filled pages. Avoid khugepage bypassing userfaultfd by not collapsing
> > pages in shmem reached via scanning a vma with an armed userfaultfd if
> > doing so would zero-fill any pages.
> >
> > Signed-off-by: David Stevens <stevensd@xxxxxxxxxxxx>
> > ---
> > mm/khugepaged.c | 35 ++++++++++++++++++++++++-----------
> > 1 file changed, 24 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 79be13133322..48e944fb8972 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1736,8 +1736,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > * + restore gaps in the page cache;
> > * + unlock and free huge page;
> > */
> > -static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > - struct file *file, pgoff_t start,
> > +static int collapse_file(struct mm_struct *mm, struct vm_area_struct *vma,
> > + unsigned long addr, struct file *file, pgoff_t start,
> > struct collapse_control *cc)
> > {
> > struct address_space *mapping = file->f_mapping;
> > @@ -1784,6 +1784,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > * be able to map it or use it in another way until we unlock it.
> > */
> >
> > + if (is_shmem)
> > + mmap_read_lock(mm);
>
> If you release mmap_lock before then reacquire it here, the vma is not
> trusted anymore. It is not safe to use the vma anymore.
>
> Since you already read uffd_was_armed before releasing mmap_lock, so
> you could pass it directly to collapse_file w/o dereferencing vma
> again. The problem may be false positive (not userfaultfd armed
> anymore), but it should be fine. Khugepaged could collapse this area
> in the next round.

Unfortunately that may not be enough.. because it's also possible that it
reads uffd_armed==false, released mmap_sem, passed it over to the scanner,
but then when scanning the file uffd got armed in parallel.

There's another problem where the current vma may not have uffd armed,
khugepaged may think it has nothing to do with uffd and moved on with
collapsing, but actually it's armed in another vma of either the current mm
or just another mm's.

It seems non-trivial too to safely check this across all the vmas, let's
say, by a reverse walk - the only safe way is to walk all the vmas and take
the write lock for every mm, but that's not only too heavy but also merely
impossible to always make it right because of deadlock issues and on the
order of mmap write lock to take..

So far what I can still think of is, if we can extend shmem_inode_info and
have a counter showing how many uffd has been armed. It can be a generic
counter too (e.g. shmem_inode_info.collapse_guard_counter) just to avoid
the page cache being collapsed under the hood, but I am also not aware of
whether it can be reused by other things besides uffd.

Then when we do the real collapsing, say, when:

xas_set_order(&xas, start, HPAGE_PMD_ORDER);
xas_store(&xas, hpage);
xas_unlock_irq(&xas);

We may need to make sure that counter keeps static (probably by holding
some locks during the process) and we only do that last phase collapse if
counter==0.

Similar checks in this patch can still be done, but that'll only service as
a role of failing faster before the ultimate check on the uffd_armed
counter. Otherwise I just don't quickly see how to avoid race conditions.

It'll be great if someone can come up with something better than above..
Copy Hugh too.

Thanks,

--
Peter Xu