Re: [PATCH next] mm/khugepaged: fix conflicting mods to collapse_file()

From: Hugh Dickins
Date: Sun Apr 23 2023 - 22:17:29 EST


On Mon, 24 Apr 2023, David Stevens wrote:
> On Sun, Apr 23, 2023 at 1:47 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> >
> > Inserting Ivan Orlov's syzbot fix commit 2ce0bdfebc74
> > ("mm: khugepaged: fix kernel BUG in hpage_collapse_scan_file()")
> > ahead of Jiaqi Yan's and David Stevens's commits
> > 12904d953364 ("mm/khugepaged: recover from poisoned file-backed memory")
> > cae106dd67b9 ("mm/khugepaged: refactor collapse_file control flow")
> > ac492b9c70ca ("mm/khugepaged: skip shmem with userfaultfd")
> > (all of which restructure collapse_file()) did not work out well.
> >
> > xfstests generic/086 on huge tmpfs (with accelerated khugepaged) freezes
> > (if not on the first attempt, then the 2nd or 3rd) in find_lock_entries()
> > while doing drop_caches: the file's xarray seems to have been corrupted,
> > with find_get_entry() returning nonsense which makes no progress.
> >
> > Bisection led to ac492b9c70ca; and diff against earlier working linux-next
> > suggested that it's probably down to an errant xas_store(), which does not
> > belong with the later changes (and nor does the positioning of warnings).
> > The later changes look as if they fix the syzbot issue independently.
> >
> > Remove most of what's left of 2ce0bdfebc74: just leave one WARN_ON_ONCE
> > (xas_error) after the final xas_store() of the multi-index entry.
> >
> > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
> > ---
> >
> > mm/khugepaged.c | 23 +----------------------
> > 1 file changed, 1 insertion(+), 22 deletions(-)
> >
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1941,16 +1941,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > result = SCAN_FAIL;
> > goto xa_locked;
> > }
> > - xas_store(&xas, hpage);
> > - if (xas_error(&xas)) {
> > - /* revert shmem_charge performed
> > - * in the previous condition
> > - */
> > - mapping->nrpages--;
> > - shmem_uncharge(mapping->host, 1);
> > - result = SCAN_STORE_FAILED;
>
> With this being removed, SCAN_STORE_FAILED should also be removed from
> the scan_result enum and trace event definitions.

Only if we also remove your use of SCAN_STORE_FAILED in ac492b9c70ca:
what would you want that to say instead?

I don't care myself for any of those "SCAN" result codes, nor whether they
are few or many: I'd rather have __LINE__ numbers for my own debugging.

But if people want to remove SCAN_STORE_FAILED now, sure, send a patch;
my intent was to unbreak the breakage.

Hugh