Re: [PATCH] mm/khugepaged: fix collapse_pte_mapped_thp() to allow anon_vma

From: David Hildenbrand
Date: Thu Jan 05 2023 - 05:45:50 EST


On 05.01.23 01:03, Yang Shi wrote:
On Wed, Jan 4, 2023 at 1:20 AM David Hildenbrand <david@xxxxxxxxxx> wrote:

Or am I wrong?

Is anon_vma lock required? Almost not: if any page other than expected
subpage of the non-anon huge page is found in the page table, collapse is
aborted without making any change. However, it is possible that an anon
page was CoWed from this extent in another mm or vma, in which case a
concurrent lookup might look here: so keep it away while clearing pmd
(but perhaps we shall go back to using pmd_lock() there in future).

Note that collapse_pte_mapped_thp() is exceptional in freeing a page table
without having cleared its ptes: I'm uneasy about that, and had thought
pte_clear()ing appropriate; but exclusive i_mmap lock does fix the problem,
and we would have to move the mmu_notification if clearing those ptes.

Fixes: 8d3c106e19e8 ("mm/khugepaged: take the right locks for page table
retraction")
Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Jann Horn <jannh@xxxxxxxxxx>
Cc: Yang Shi <shy828301@xxxxxxxxx>
Cc: David Hildenbrand <david@xxxxxxxxxx>
Cc: Zach O'Keefe <zokeefe@xxxxxxxxxx>
Cc: Song Liu <songliubraving@xxxxxx>
Cc: <stable@xxxxxxxxxxxxxxx> [5.4+]
---
What this fixes is not a dangerous instability! But I suggest Cc stable
because uprobes "healing" has regressed in that way, so this should follow
8d3c106e19e8 into those stable releases where it was backported (and may
want adjustment there - I'll supply backports as needed).

If it's really something that doesn't matter in practice (e.g., -1%
performance while debugging :) ), I guess no CC is needed. If there are real
production workloads that suffer, I guess ccing stable is fine.

It's about recovering performance *after* debugging. It is not something
that is of any value to me personally, nor (so far as I know) to anyone
whom I work with. But it is something which Song Liu went to the trouble
to make possible in his "THP aware uprobe" series three years ago, and it
is something which Jann unintentionally regressed in his recent commit:
so I thought it proper to reinstate where regressed.

Right, although I wonder if that original series fixed a real
performance issue or was more a "this makes sense, let's just optimize
this corner case by some serious complexity". I hope it's not the latter :)


(What I do have more of an investment in, is for MADV_COLLAPSE to be able
to collapse some extents in a large vma where some other extent got CoWed,
so giving the whole vma an anon_vma. But that's not an issue for -stable,
and I cannot tell you offhand whether undoing this anon_vma exclusion is
enough to enable that or not - I suspect not, I suspect a result code or
switch statement needs to be adjusted too.)

Yeah, having a single COWed page in a large MAP_PRIVATE file/shmem
mapping would disable collapse, so it's the right thing to do.

Thinking about it some more, and the effective code change, stable
doesn't sound wrong.



Side note: set_huge_pmd() wins the award of "ugliest mm function of early
2023". I was briefly concerned how do_set_pmd() decides whether the PMD can be
writable or not. Turns out it's communicated via vm_fault->flags. Just
horrible.

I firmly disagree - it's from 2022! and much too small to be ugliest;
but I haven't thought about the aspect that is bothering you there.

The ugliest I stumbled over in early 2023 -- until January 2nd :D


What's bothered me most about it, is the way its name, and the naming of
the do_set_pmd() it interfaces with, give no hint that they are entirely
about file (or shmem) vmas, and would not work right on anon vmas
(I forget whether it's just a matter of which stats updated, or more).

Yes. I dug very deep into in-place collapse yesterday because I was
briefly concerned about anon THP, and it took me longer to understand
that whole machinery than it should (and that anon THP never ever
collapse in-place).

Some of that khugepaged stuff needs some *serious* cleanups and
refactoring. do_set_pmd() is not an exception.


Some more examples:

if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
...
hpage_collapse_scan_file()
} else {
hpage_collapse_scan_pmd()
...
}


1) hpage_collapse_scan_pmd() is only for anon memory. Totally obvious
from the name. But why are we potentially calling it for VMAs that
are not applicable? For maximum David confusion?

IIRC the VMAs are checked before, what do you mean about "not
applicable"? But anyway khugepaged/MADV_COLLAPSE does release and

I assume when CONFIG_SHMEM=n with ordinary file-thp we'll end up calling it.

reacquire mmap_lock multiple times, so there are multiple places to
check VMAs validity.


hpage_collapse_scan_pmd() should be renamed to something like hpage_collapse_scan_an/on() and the duplicate code in khugepaged_scan_mm_slot() and madvise_collapse() should be factored out into something like:

hpage_collapse_scan(vma, addr, cc)
{
if (vma->vm_file) {
...
hpage_collapse_scan_file()
...
} else if (vma_is_anonymous(vma)) {
hpage_collapse_scan_anon()
} else {
WARN_ON_ONCE();
}
}

Any CONFIG_SHMEM etc. optimizations to compile that code out should go into hpage_collapse_scan_file() IMHO. ... also properly checking for ordinary file THP support.

... and we'd really decide on a terminology "transhuge", "hugepage", "hpage", it's a mess. hpage might be easiest, or simply "thp". We just need a way to distinguish all that stuff from hugetlb.


2) "IS_ENABLED(CONFIG_SHMEM) && vma->vm_file" is also supposed to cover
ordinary file-thp. Totally obvious from the IS_ENABLED(CONFIG_SHMEM)
... I probably spent 30minutes understanding what's happening here.
Just misleading and wrong without CONFIG_SHMEM.


... and what's easier to get than this magic set of boolean flags:

hugepage_vma_check(vma, vma->vm_flags, false, false, true)

This is not perfect. I was thinking about changing them to one flag,
just like TTU_ flags used by try_to_unmap(). That may make things
cleaner.


We should provide similar flags to hugepage_vma_revalidate() and just replace the "cc" parameter by a way to indicate is_khugepaged. Passing in cc is just overkill.

We'd name the functions thp_vma_validate() and thp_vma_revalidate() or sth. like that.


... and obviously
hugepage_vma_revalidate()
is supposed to be a follow up to a previous
hugepage_vma_check()
and totally different from
transhuge_vma_suitable()

Hard to make it even less consistent.

This was after my cleanup, it was much messier before. And I did add
comments to make them more understandable, but anyway better naming is
definitely welcome.

Yeah, I appreciate any previous and any future cleanups in that area.

For example: why even *care* about the complexity of installing a PMD in collapse_pte_mapped_thp() using set_huge_pmd() just for MADV_COLLAPSE?

Sure, we avoid a single page fault afterwards, but is this *really* worth the extra code here? I mean, after we installed the PMD, the page could just get reclaimed either way, so there is no guarantee that we have a PMD mapped once we return to user space IIUC.


Anyhow, don't want to hijack this thread. I was just forced to understand that code and a lot jumped at me :)

--
Thanks,

David / dhildenb