Re: [PATCH 2/9] mm: Add an apply_to_pfn_range interface

From: Thomas Hellstrom
Date: Wed Apr 17 2019 - 05:16:30 EST


On Tue, 2019-04-16 at 10:46 -0400, Jerome Glisse wrote:
> On Sat, Apr 13, 2019 at 08:34:02AM +0000, Thomas Hellstrom wrote:
> > Hi, JÃrÃme
> >
> > On Fri, 2019-04-12 at 17:07 -0400, Jerome Glisse wrote:
> > > On Fri, Apr 12, 2019 at 04:04:18PM +0000, Thomas Hellstrom wrote:
> > > > This is basically apply_to_page_range with added functionality:
> > > > Allocating missing parts of the page table becomes optional,
> > > > which
> > > > means that the function can be guaranteed not to error if
> > > > allocation
> > > > is disabled. Also passing of the closure struct and callback
> > > > function
> > > > becomes different and more in line with how things are done
> > > > elsewhere.
> > > >
> > > > Finally we keep apply_to_page_range as a wrapper around
> > > > apply_to_pfn_range
> > > >
> > > > The reason for not using the page-walk code is that we want to
> > > > perform
> > > > the page-walk on vmas pointing to an address space without
> > > > requiring the
> > > > mmap_sem to be held rather thand on vmas belonging to a process
> > > > with the
> > > > mmap_sem held.
> > > >
> > > > Notable changes since RFC:
> > > > Don't export apply_to_pfn range.
> > > >
> > > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> > > > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> > > > Cc: Will Deacon <will.deacon@xxxxxxx>
> > > > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > > > Cc: Rik van Riel <riel@xxxxxxxxxxx>
> > > > Cc: Minchan Kim <minchan@xxxxxxxxxx>
> > > > Cc: Michal Hocko <mhocko@xxxxxxxx>
> > > > Cc: Huang Ying <ying.huang@xxxxxxxxx>
> > > > Cc: Souptick Joarder <jrdr.linux@xxxxxxxxx>
> > > > Cc: "JÃrÃme Glisse" <jglisse@xxxxxxxxxx>
> > > > Cc: linux-mm@xxxxxxxxx
> > > > Cc: linux-kernel@xxxxxxxxxxxxxxx
> > > > Signed-off-by: Thomas Hellstrom <thellstrom@xxxxxxxxxx>
> > > > ---
> > > > include/linux/mm.h | 10 ++++
> > > > mm/memory.c | 130 ++++++++++++++++++++++++++++++++++---
> > > > ----
> > > > ----
> > > > 2 files changed, 108 insertions(+), 32 deletions(-)
> > > >
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 80bb6408fe73..b7dd4ddd6efb 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -2632,6 +2632,16 @@ typedef int (*pte_fn_t)(pte_t *pte,
> > > > pgtable_t token, unsigned long addr,
> > > > extern int apply_to_page_range(struct mm_struct *mm, unsigned
> > > > long
> > > > address,
> > > > unsigned long size, pte_fn_t fn,
> > > > void
> > > > *data);
> > > >
> > > > +struct pfn_range_apply;
> > > > +typedef int (*pter_fn_t)(pte_t *pte, pgtable_t token, unsigned
> > > > long addr,
> > > > + struct pfn_range_apply *closure);
> > > > +struct pfn_range_apply {
> > > > + struct mm_struct *mm;
> > > > + pter_fn_t ptefn;
> > > > + unsigned int alloc;
> > > > +};
> > > > +extern int apply_to_pfn_range(struct pfn_range_apply *closure,
> > > > + unsigned long address, unsigned
> > > > long
> > > > size);
> > > >
> > > > #ifdef CONFIG_PAGE_POISONING
> > > > extern bool page_poisoning_enabled(void);
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index a95b4a3b1ae2..60d67158964f 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -1938,18 +1938,17 @@ int vm_iomap_memory(struct
> > > > vm_area_struct
> > > > *vma, phys_addr_t start, unsigned long
> > > > }
> > > > EXPORT_SYMBOL(vm_iomap_memory);
> > > >
> > > > -static int apply_to_pte_range(struct mm_struct *mm, pmd_t
> > > > *pmd,
> > > > - unsigned long addr,
> > > > unsigned long
> > > > end,
> > > > - pte_fn_t fn, void *data)
> > > > +static int apply_to_pte_range(struct pfn_range_apply *closure,
> > > > pmd_t *pmd,
> > > > + unsigned long addr, unsigned long
> > > > end)
> > > > {
> > > > pte_t *pte;
> > > > int err;
> > > > pgtable_t token;
> > > > spinlock_t *uninitialized_var(ptl);
> > > >
> > > > - pte = (mm == &init_mm) ?
> > > > + pte = (closure->mm == &init_mm) ?
> > > > pte_alloc_kernel(pmd, addr) :
> > > > - pte_alloc_map_lock(mm, pmd, addr, &ptl);
> > > > + pte_alloc_map_lock(closure->mm, pmd, addr,
> > > > &ptl);
> > > > if (!pte)
> > > > return -ENOMEM;
> > > >
> > > > @@ -1960,86 +1959,107 @@ static int apply_to_pte_range(struct
> > > > mm_struct *mm, pmd_t *pmd,
> > > > token = pmd_pgtable(*pmd);
> > > >
> > > > do {
> > > > - err = fn(pte++, token, addr, data);
> > > > + err = closure->ptefn(pte++, token, addr,
> > > > closure);
> > > > if (err)
> > > > break;
> > > > } while (addr += PAGE_SIZE, addr != end);
> > > >
> > > > arch_leave_lazy_mmu_mode();
> > > >
> > > > - if (mm != &init_mm)
> > > > + if (closure->mm != &init_mm)
> > > > pte_unmap_unlock(pte-1, ptl);
> > > > return err;
> > > > }
> > > >
> > > > -static int apply_to_pmd_range(struct mm_struct *mm, pud_t
> > > > *pud,
> > > > - unsigned long addr,
> > > > unsigned long
> > > > end,
> > > > - pte_fn_t fn, void *data)
> > > > +static int apply_to_pmd_range(struct pfn_range_apply *closure,
> > > > pud_t *pud,
> > > > + unsigned long addr, unsigned long
> > > > end)
> > > > {
> > > > pmd_t *pmd;
> > > > unsigned long next;
> > > > - int err;
> > > > + int err = 0;
> > > >
> > > > BUG_ON(pud_huge(*pud));
> > > >
> > > > - pmd = pmd_alloc(mm, pud, addr);
> > > > + pmd = pmd_alloc(closure->mm, pud, addr);
> > > > if (!pmd)
> > > > return -ENOMEM;
> > > > +
> > > > do {
> > > > next = pmd_addr_end(addr, end);
> > > > - err = apply_to_pte_range(mm, pmd, addr, next,
> > > > fn,
> > > > data);
> > > > + if (!closure->alloc &&
> > > > pmd_none_or_clear_bad(pmd))
> > > > + continue;
> > > > + err = apply_to_pte_range(closure, pmd, addr,
> > > > next);
> > > > if (err)
> > > > break;
> > > > } while (pmd++, addr = next, addr != end);
> > > > return err;
> > > > }
> > > >
> > > > -static int apply_to_pud_range(struct mm_struct *mm, p4d_t
> > > > *p4d,
> > > > - unsigned long addr,
> > > > unsigned long
> > > > end,
> > > > - pte_fn_t fn, void *data)
> > > > +static int apply_to_pud_range(struct pfn_range_apply *closure,
> > > > p4d_t *p4d,
> > > > + unsigned long addr, unsigned long
> > > > end)
> > > > {
> > > > pud_t *pud;
> > > > unsigned long next;
> > > > - int err;
> > > > + int err = 0;
> > > >
> > > > - pud = pud_alloc(mm, p4d, addr);
> > > > + pud = pud_alloc(closure->mm, p4d, addr);
> > > > if (!pud)
> > > > return -ENOMEM;
> > > > +
> > > > do {
> > > > next = pud_addr_end(addr, end);
> > > > - err = apply_to_pmd_range(mm, pud, addr, next,
> > > > fn,
> > > > data);
> > > > + if (!closure->alloc &&
> > > > pud_none_or_clear_bad(pud))
> > > > + continue;
> > > > + err = apply_to_pmd_range(closure, pud, addr,
> > > > next);
> > > > if (err)
> > > > break;
> > > > } while (pud++, addr = next, addr != end);
> > > > return err;
> > > > }
> > > >
> > > > -static int apply_to_p4d_range(struct mm_struct *mm, pgd_t
> > > > *pgd,
> > > > - unsigned long addr,
> > > > unsigned long
> > > > end,
> > > > - pte_fn_t fn, void *data)
> > > > +static int apply_to_p4d_range(struct pfn_range_apply *closure,
> > > > pgd_t *pgd,
> > > > + unsigned long addr, unsigned long
> > > > end)
> > > > {
> > > > p4d_t *p4d;
> > > > unsigned long next;
> > > > - int err;
> > > > + int err = 0;
> > > >
> > > > - p4d = p4d_alloc(mm, pgd, addr);
> > > > + p4d = p4d_alloc(closure->mm, pgd, addr);
> > > > if (!p4d)
> > > > return -ENOMEM;
> > > > +
> > > > do {
> > > > next = p4d_addr_end(addr, end);
> > > > - err = apply_to_pud_range(mm, p4d, addr, next,
> > > > fn,
> > > > data);
> > > > + if (!closure->alloc &&
> > > > p4d_none_or_clear_bad(p4d))
> > > > + continue;
> > > > + err = apply_to_pud_range(closure, p4d, addr,
> > > > next);
> > > > if (err)
> > > > break;
> > > > } while (p4d++, addr = next, addr != end);
> > > > return err;
> > > > }
> > > >
> > > > -/*
> > > > - * Scan a region of virtual memory, filling in page tables as
> > > > necessary
> > > > - * and calling a provided function on each leaf page table.
> > > > +/**
> > > > + * apply_to_pfn_range - Scan a region of virtual memory,
> > > > calling a
> > > > provided
> > > > + * function on each leaf page table entry
> > > > + * @closure: Details about how to scan and what function to
> > > > apply
> > > > + * @addr: Start virtual address
> > > > + * @size: Size of the region
> > > > + *
> > > > + * If @closure->alloc is set to 1, the function will fill in
> > > > the
> > > > page table
> > > > + * as necessary. Otherwise it will skip non-present parts.
> > > > + * Note: The caller must ensure that the range does not
> > > > contain
> > > > huge pages.
> > > > + * The caller must also assure that the proper mmu_notifier
> > > > functions are
> > > > + * called. Either in the pte leaf function or before and after
> > > > the
> > > > call to
> > > > + * apply_to_pfn_range.
> > >
> > > This is wrong there should be a big FAT warning that this can
> > > only be
> > > use
> > > against mmap of device file. The page table walking above is
> > > broken
> > > for
> > > various thing you might find in any other vma like THP, device
> > > pte,
> > > hugetlbfs,
> >
> > I was figuring since we didn't export the function anymore, the
> > warning
> > and checks could be left to its users, assuming that any other
> > future
> > usage of this function would require mm people audit anyway. But I
> > can
> > of course add that warning also to this function if you still want
> > that?
>
> Yeah more warning are better, people might start using this, i know
> some poeple use unexported symbol and then report bugs while they
> just were doing something illegal.
>
> > > ...
> > >
> > > Also the mmu notifier can not be call from the pfn callback as
> > > that
> > > callback
> > > happens under page table lock (the change_pte notifier callback
> > > is
> > > useless
> > > and not enough). So it _must_ happen around the call to
> > > apply_to_pfn_range
> >
> > In the comments I was having in mind usage of, for example
> > ptep_clear_flush_notify(). But you're the mmu_notifier expert here.
> > Are
> > you saying that function by itself would not be sufficient?
> > In that case, should I just scratch the text mentioning the pte
> > leaf
> > function?
>
> ptep_clear_flush_notify() is useless ... i have posted patches to
> either
> restore it or remove it. In any case you must call mmu notifier range
> and
> they can not happen under lock. You usage looked fine (in the next
> patch)
> but i would rather have a bit of comment here to make sure people are
> also
> aware of that.
>
> While we can hope that people would cc mm when using mm function, it
> is
> not always the case. So i rather be cautious and warn in comment as
> much
> as possible.
>

OK. Understood. All this actually makes me tend to want to try a bit
harder using a slight modification to the pagewalk code instead. Don't
really want to encourage two parallel code paths doing essentially the
same thing; one good and one bad.

One thing that confuses me a bit with the pagewalk code is that callers
(for example softdirty) typically call
mmu_notifier_invalidate_range_start() around the pagewalk, but then if
it ends up splitting a pmd, mmu_notifier_invalidate_range is called
again, within the first range. Docs aren't really clear whether that's
permitted or not. Is it?

Thanks,
Thomas

> Cheers,
> JÃrÃme