Re: [PATCH next] mm/khugepaged: fix conflicting mods to collapse_file()

From: David Stevens
Date: Sun Apr 23 2023 - 22:37:16 EST


On Mon, Apr 24, 2023 at 11:17 AM Hugh Dickins <hughd@xxxxxxxxxx> wrote:
>
> On Mon, 24 Apr 2023, David Stevens wrote:
> > On Sun, Apr 23, 2023 at 1:47 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> > >
> > > Inserting Ivan Orlov's syzbot fix commit 2ce0bdfebc74
> > > ("mm: khugepaged: fix kernel BUG in hpage_collapse_scan_file()")
> > > ahead of Jiaqi Yan's and David Stevens's commits
> > > 12904d953364 ("mm/khugepaged: recover from poisoned file-backed memory")
> > > cae106dd67b9 ("mm/khugepaged: refactor collapse_file control flow")
> > > ac492b9c70ca ("mm/khugepaged: skip shmem with userfaultfd")
> > > (all of which restructure collapse_file()) did not work out well.
> > >
> > > xfstests generic/086 on huge tmpfs (with accelerated khugepaged) freezes
> > > (if not on the first attempt, then the 2nd or 3rd) in find_lock_entries()
> > > while doing drop_caches: the file's xarray seems to have been corrupted,
> > > with find_get_entry() returning nonsense which makes no progress.
> > >
> > > Bisection led to ac492b9c70ca; and diff against earlier working linux-next
> > > suggested that it's probably down to an errant xas_store(), which does not
> > > belong with the later changes (and nor does the positioning of warnings).
> > > The later changes look as if they fix the syzbot issue independently.
> > >
> > > Remove most of what's left of 2ce0bdfebc74: just leave one WARN_ON_ONCE
> > > (xas_error) after the final xas_store() of the multi-index entry.
> > >
> > > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
> > > ---
> > >
> > > mm/khugepaged.c | 23 +----------------------
> > > 1 file changed, 1 insertion(+), 22 deletions(-)
> > >
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1941,16 +1941,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > result = SCAN_FAIL;
> > > goto xa_locked;
> > > }
> > > - xas_store(&xas, hpage);
> > > - if (xas_error(&xas)) {
> > > - /* revert shmem_charge performed
> > > - * in the previous condition
> > > - */
> > > - mapping->nrpages--;
> > > - shmem_uncharge(mapping->host, 1);
> > > - result = SCAN_STORE_FAILED;
> >
> > With this being removed, SCAN_STORE_FAILED should also be removed from
> > the scan_result enum and trace event definitions.
>
> Only if we also remove your use of SCAN_STORE_FAILED in ac492b9c70ca:
> what would you want that to say instead?
>
> I don't care myself for any of those "SCAN" result codes, nor whether they
> are few or many: I'd rather have __LINE__ numbers for my own debugging.
>
> But if people want to remove SCAN_STORE_FAILED now, sure, send a patch;
> my intent was to unbreak the breakage.

Oh, I had forgotten that I used the code when I rebased v6 of my
patches. I have no strong opinions one way or the other. Sorry for the
noise.

-David