Re: [syzbot] [mm?] kernel BUG in vma_replace_policy

From: Hugh Dickins
Date: Mon Sep 18 2023 - 20:34:21 EST

Next message: Qu Wenruo: "Re: [PATCH 1/4] btrfs: fix 64bit division in btrfs_insert_striped_mirrored_raid_extents"
Previous message: Sergey Senozhatsky: "Re: [PATCH 0/5] Improve zram writeback performance"
In reply to: Yang Shi: "Re: [syzbot] [mm?] kernel BUG in vma_replace_policy"
Next in thread: Suren Baghdasaryan: "Re: [syzbot] [mm?] kernel BUG in vma_replace_policy"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 18 Sep 2023, Yang Shi wrote:
> On Fri, Sep 15, 2023 at 8:57 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> > On Fri, 15 Sep 2023, Yang Shi wrote:
> > >
> > > Hi Suren and Hugh,
> > >
> > > Thanks for figuring this out. The mbind behavior is a little bit messy
> > > and hard to follow. I tried my best to recall all the changes.
> >
> > Messy and confusing yes; and for every particular behavior, I suspect
> > that by now there exists some release which has done it that way.
> >
> > >
> > > IIUC, mbind did break the vma iteration early in the first place, then
> > > commit 6f4576e3687b ("mempolicy: apply page table walker on
> > > queue_pages_range()") changed the behavior (didn't break vma iteration
> > > early for some cases anymore), but it messed up the return value and
> > > caused some test cases failure, also violated the manual. The return
> > > value issue was fixed by commit a7f40cfe3b7a ("mm: mempolicy: make
> > > mbind() return -EIO when MPOL_MF_STRICT is specified"), this commit
> > > also restored the oldest behavior (break loop early). But it also
> > > breaks the loop early when MPOL_MF_MOVE|MOVEALL is set, kernel should
> > > actually continue the loop to try to migrate all existing pages per
> > > the manual.
> >
> > Oh, I missed that aspect in my description: yes, I think that's the
> > worst of it: MPOL_MF_STRICT alone could break out early because it had
> > nothing more to learn by going further, but it was simply a mistake for
> > the MOVEs to break out early (and arguable what MOVE|STRICT should do).
> >
> > I thought you and I were going to have a debate about this, but we
> > appear to be in agreement. And I'm not sure whether I agree with
> > myself about whether do_mbind() should apply the mbind_range()s
> > when STRICT queue_pages_range() found an unmovable - there are
> > consistency and regression arguments both ways.
>
> They will not be added into the migration list in the first place. Why
> waste time to try to migrate the unmovable?

I don't understand you there. I was not proposing to try to migrate
the unmovable.

My doubts were really all about how to make sense of mbind() sometimes
failing with EFAULT, in which case it has not applied the mbind_range()s,
versus sometimes failing with EIO, in which case it may or may not have
applied the mbind_range()s.

And I've come to the conclusion (partially driven by precedent) that it
makes best sense to imagine the collection of folios on pagelist as a
part of MOVE's migration stage, and just an implementation detail that
it happens to be done before the mbind_range()s. So when there's a
MOVE involved, STRICT's EIO says that the mbind_ranges() were applied
(but migrations were incomplete); but when no MOVE involved, EIO says
that the mbind_range()s were not applied (because it's being STRICT).

I don't think there's any disagreement between us on this: it was just
hard for me to reach an understanding of behavior which I could defend.

>
> >
> > (I've been repeatedly puzzled by your comment in queue_folios_pte_range()
> > if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
> > /* MPOL_MF_STRICT must be specified if we get here */
> > if (!vma_migratable(vma)) {
> > Does that commment about MPOL_MF_STRICT actually belong inside the
> > !vma_migratable(vma) block? Sometimes I think so, but sometimes I
> > remember that the interaction of those flags, and the skipping arranged
> > by queue_pages_test_walk(), is subtler than I imagine.)
>
> It is because the below code snippet from queue_pages_test_walk():
>
> if (!vma_migratable(vma) &&
> !(flags & MPOL_MF_STRICT))
> return 1;
>
> When queue_pages_test_walk() returns 1, queue_folios_pte_range() will
> be skipped. So if queue_folios_pte_range() sees unmigratable vma, it
> means MPOL_MF_STRICT must be set.

Thanks, yes, I eventually came to see that, once I got back into the code
(I had been right to remember "subtler than I imagine" above). Though I
don't think there's any good reason for the queueing code to have to
depend on such subtleties.

>
> >
> > > It sounds like a regression. I will take a look at it.

At one point I was thinking it a regression in all the MOVE cases;
but it's only in the STRICT MOVE case, so maybe not so significant.

> >
> > Thanks! Please do, I don't have the time for it.

I came back in private mail to say that I'd not managed a buildable
v5.2 version of my qp->nr_failed patch, so reluctantly put in the time
to think through it all again, and do a v6.6-rc1 version to add into my
mm/mempolicy series.

I have that now, I'll send you the preview privately in a moment; but
leave posting it publicly until I've finished the commit messages for
all the series.

> >
> > >
> > > So the logic should conceptually look like:
> > >
> > > if (MPOL_MF_MOVE|MOVEALL)
> > > continue;
> > > if (MPOL_MF_STRICT)
> > > break;
> > >
> > > So it is still possible that some VMAs are not locked if only
> > > MPOL_MF_STRICT is set.
> >
> > Conditionally, I'll agree; but it's too easy for me to agree in the
> > course of trying to get an email out, but on later reflection come
> > to disagree. STRICT|MOVE behavior arguable.
>
> I thought the code should conceptually do:
>
> if (MPOL_MF_MOVE|MOVEALL)
> scan all vmas
> try to migrate the existing pages
> return success
> else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
> scan all vmas
> try to migrate the existing pages
> return -EIO if unmovable or migration failed
> else /* MPOL_MF_STRICT alone */
> break early if meets unmovable and don't call mbind_range() at all
else /* none of those flags */
check the ranges in test_walk, EFAULT without mbind_range() if discontig.

Yes: to quote my own patch:
static bool strictly_unmovable(unsigned long flags)
{
/*
* STRICT without MOVE flags lets do_mbind() fail immediately with -EIO
* if any misplaced page is found.
*/
return (flags & (MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ==
MPOL_MF_STRICT;
}

>
> So the vma scan will just be skipped when MPOL_MF_STRICT alone is
> specified and mbind_range() won't be called in this case. So Suren's
> fix may not be needed.

Yes, Suren's fix can be reverted when your patch or mine goes in;
but Suren's is important for fixing the VMA locking issue meanwhile.

>
> >
> > I think the best I can do is send you (privately) my approx-v5.2 patch
> > for this (which I never got time to put into even a Google-internal
> > kernel, though an earlier version was there). In part because I did
> > more research back then, and its commit message cites several even
> > older commits than you cite above, which might help to shed more light
> > on the history (or might just be wrong). And in part because it may
> > give you some more ideas of what needs doing: notably qp->nr_failed,
> > because "man 2 migrate_pages" says "On success migrate_pages() returns
> > the number of pages that could not be moved", but we seem to have
> > lost sight of that (from which one may conclude that it's not very
> > important, but I did find it useful when testing); but of course
> > the usual doubts about the right way to count a page when compound.
> >
> > I'll check how easily that patch applies to a known base such as
> > v5.2, maybe trim it to fit better, then send it off to you.
>
> I'm thinking about the below fix (build test against the latest
> mm-unstable only):

Yes, that looks about right (more "|="ing than necessary, for something
that's only going to be set to 1, er, I think would better be "true").

And it's much smaller (rightly so if it's aimed at v6.6) than my patch
which is aimed at v6.7: mine doing quite a bit of cleanup, along with
the qp->nr_failed instead of your qp->has_unmovable, in order that
migrate_pages(2) can return the promised number of pages that could
not be moved.

Hugh

>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 42b5567e3773..c9b768a042a8 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -426,6 +426,7 @@ struct queue_pages {
> unsigned long start;
> unsigned long end;
> struct vm_area_struct *first;
> + bool has_unmovable;
> };
>
> /*
> @@ -446,9 +447,8 @@ static inline bool queue_folio_required(struct folio *folio,
> /*
> * queue_folios_pmd() has three possible return values:
> * 0 - folios are placed on the right node or queued successfully, or
> - * special page is met, i.e. huge zero page.
> - * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
> - * specified.
> + * special page is met, i.e. zero page, or unmovable page is found
> + * but continue walking (indicated by queue_pages.has_unmovable).
> * -EIO - is migration entry or only MPOL_MF_STRICT was specified and an
> * existing folio was already on a node that does not follow the
> * policy.
> @@ -479,7 +479,7 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t
> *ptl, unsigned long addr,
> if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
> if (!vma_migratable(walk->vma) ||
> migrate_folio_add(folio, qp->pagelist, flags)) {
> - ret = 1;
> + qp->has_unmovable |= 1;
> goto unlock;
> }
> } else
> @@ -495,9 +495,8 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t
> *ptl, unsigned long addr,
> *
> * queue_folios_pte_range() has three possible return values:
> * 0 - folios are placed on the right node or queued successfully, or
> - * special page is met, i.e. zero page.
> - * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
> - * specified.
> + * special page is met, i.e. zero page, or unmovable page is found
> + * but continue walking (indicated by queue_pages.has_unmovable).
> * -EIO - only MPOL_MF_STRICT was specified and an existing folio was already
> * on a node that does not follow the policy.
> */
> @@ -538,10 +537,13 @@ static int queue_folios_pte_range(pmd_t *pmd,
> unsigned long addr,
> if (!queue_folio_required(folio, qp))
> continue;
> if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
> - /* MPOL_MF_STRICT must be specified if we get here */
> + /*
> + * MPOL_MF_STRICT must be specified if we get here.
> + * Continue walking vmas due to MPOL_MF_MOVE* flags.
> + */
> if (!vma_migratable(vma)) {
> - has_unmovable = true;
> - break;
> + qp->has_unmovable |= 1;
> + continue;
> }
>
> /*
> @@ -550,16 +552,13 @@ static int queue_folios_pte_range(pmd_t *pmd,
> unsigned long addr,
> * need migrate other LRU pages.
> */
> if (migrate_folio_add(folio, qp->pagelist, flags))
> - has_unmovable = true;
> + has_unmovable |= 1;
> } else
> break;
> }
> pte_unmap_unlock(mapped_pte, ptl);
> cond_resched();
>
> - if (has_unmovable)
> - return 1;
> -
> return addr != end ? -EIO : 0;
> }
>
> @@ -599,7 +598,7 @@ static int queue_folios_hugetlb(pte_t *pte,
> unsigned long hmask,
> * Detecting misplaced folio but allow migrating folios which
> * have been queued.
> */
> - ret = 1;
> + qp->has_unmovable |= 1;
> goto unlock;
> }
>
> @@ -620,7 +619,7 @@ static int queue_folios_hugetlb(pte_t *pte,
> unsigned long hmask,
> * Failed to isolate folio but allow migrating pages
> * which have been queued.
> */
> - ret = 1;
> + qp->has_unmovable |= 1;
> }
> unlock:
> spin_unlock(ptl);
> @@ -756,12 +755,15 @@ queue_pages_range(struct mm_struct *mm, unsigned
> long start, unsigned long end,
> .start = start,
> .end = end,
> .first = NULL,
> + .has_unmovable = false,
> };
> const struct mm_walk_ops *ops = lock_vma ?
> &queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops;
>
> err = walk_page_range(mm, start, end, ops, &qp);
>
> + if (qp.has_unmovable)
> + err = 1;
> if (!qp.first)
> /* whole range in hole */
> err = -EFAULT;
> @@ -1358,7 +1360,7 @@ static long do_mbind(unsigned long start,
> unsigned long len,
> putback_movable_pages(&pagelist);
> }
>
> - if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
> + if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT))
> err = -EIO;
> } else {
> up_out:

Next message: Qu Wenruo: "Re: [PATCH 1/4] btrfs: fix 64bit division in btrfs_insert_striped_mirrored_raid_extents"
Previous message: Sergey Senozhatsky: "Re: [PATCH 0/5] Improve zram writeback performance"
In reply to: Yang Shi: "Re: [syzbot] [mm?] kernel BUG in vma_replace_policy"
Next in thread: Suren Baghdasaryan: "Re: [syzbot] [mm?] kernel BUG in vma_replace_policy"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]