Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio

From: Barry Song
Date: Mon Mar 04 2024 - 16:05:18 EST


On Tue, Mar 5, 2024 at 1:41 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 04.03.24 13:20, Ryan Roberts wrote:
> > Hi Barry,
> >
> > On 04/03/2024 10:37, Barry Song wrote:
> >> From: Barry Song <v-songbaohua@xxxxxxxx>
> >>
> >> page_vma_mapped_walk() within try_to_unmap_one() races with other
> >> PTEs modification such as break-before-make, while iterating PTEs
> >> of a large folio, it will only begin to acquire PTL after it gets
> >> a valid(present) PTE. break-before-make intermediately sets PTEs
> >> to pte_none. Thus, a large folio's PTEs might be partially skipped
> >> in try_to_unmap_one().
> >
> > I just want to check my understanding here - I think the problem occurs for
> > PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
> > that I've had a look at the code and have a better understanding, I think that
> > must be the case? And therefore this problem exists independently of my work to
> > support swap-out of mTHP? (From your previous report I was under the impression
> > that it only affected mTHP).
> >
> > Its just that the problem is becoming more pronounced because with mTHP,
> > PTE-mapped large folios are much more common?
>
> That is my understanding.
>
> >
> >> For example, for an anon folio, after try_to_unmap_one(), we may
> >> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
> >> So folio will be still mapped, the folio fails to be reclaimed.
> >> What’s even more worrying is, its PTEs are no longer in a unified
> >> state. This might lead to accident folio_split() afterwards. And
> >> since a part of PTEs are now swap entries, accessing them will
> >> incur page fault - do_swap_page.
> >> It creates both anxiety and more expense. While we can't avoid
> >> userspace's unmap to break up unified PTEs such as CONT-PTE for
> >> a large folio, we can indeed keep away from kernel's breaking up
> >> them due to its code design.
> >> This patch is holding PTL from PTE0, thus, the folio will either
> >> be entirely reclaimed or entirely kept. On the other hand, this
> >> approach doesn't increase PTL contention. Even w/o the patch,
> >> page_vma_mapped_walk() will always get PTL after it sometimes
> >> skips one or two PTEs because intermediate break-before-makes
> >> are short, according to test. Of course, even w/o this patch,
> >> the vast majority of try_to_unmap_one still can get PTL from
> >> PTE0. This patch makes the number 100%.
> >> The other option is that we can give up in try_to_unmap_one
> >> once we find PTE0 is not the first entry we get PTL, we call
> >> page_vma_mapped_walk_done() to end the iteration at this case.
> >> This will keep the unified PTEs while the folio isn't reclaimed.
> >> The result is quite similar with small folios with one PTE -
> >> either entirely reclaimed or entirely kept.
> >> Reclaiming large folios by holding PTL from PTE0 seems a better
> >> option comparing to giving up after detecting PTL begins from
> >> non-PTE0.
> >>
>
> I'm sure that wall of text can be formatted in a better way :) . Also, I
> think we can drop some of the details,
>
> If you need some inspiration, I can give it a shot.
>
> >> Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> >> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
> >
> > Do we need a Fixes tag?

I am not quite sure which commit should be here for a fixes tag.
I think it's more of an optimization.

> >
>
> What would be the description of the problem we are fixing?
>
> 1) failing to unmap?
>
> That can happen with small folios as well IIUC.
>
> 2) Putting the large folio on the deferred split queue?
>
> That sounds more reasonable.

I don't feel it is reasonable. Avoiding this kind of accident splitting
from the kernel's improper code is a more reasonable approach
as there is always a price to pay for splitting and unfolding PTEs
etc.

While we can't avoid splitting coming from userspace's
MADV_DONTNEED, munmap, mprotect, we have a way
to ensure the kernel itself doesn't accidently break up a
large folio.

In OPPO's phones, we ran into some weird bugs due to skipped PTEs
in try_to_unmap_one. hardly could we fix it from the root cause. with
various races, figuring out their timings was really a big pain :-)

But we did "resolve" those bugs by entirely untouching all PTEs if we
found some PTEs were skipped in try_to_unmap_one [1].

While we find we only get the PTL from 2nd, 3rd but not
1st PTE, we entirely give up on try_to_unmap_one, and leave
all PTEs untouched.

/* we are not starting from head */
if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
ret = false;
atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
set_pte_at(mm, address, pvmw.pte, pteval);
page_vma_mapped_walk_done(&pvmw);
break;
}
This will ensure all PTEs still have a unified state such as CONT-PTE
after try_to_unmap fails.
I feel this could have some false postive because when racing
with unmap, 1st PTE might really become pte_none. So explicitly
holding PTL from 1st PTE seems a better way.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c#L1730

>
> >> ---
> >> mm/vmscan.c | 11 +++++++++++
> >> 1 file changed, 11 insertions(+)
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 0b888a2afa58..e4722fbbcd0c 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>
> >> if (folio_test_pmd_mappable(folio))
> >> flags |= TTU_SPLIT_HUGE_PMD;
> >> + /*
> >> + * if page table lock is not held from the first PTE of
> >> + * a large folio, some PTEs might be skipped because of
> >> + * races with break-before-make, for example, PTEs can
> >> + * be pte_none intermediately, thus one or more PTEs
> >> + * might be skipped in try_to_unmap_one, we might result
> >> + * in a large folio is partially mapped and partially
> >> + * unmapped after try_to_unmap
> >> + */
> >> + if (folio_test_large(folio))
> >> + flags |= TTU_SYNC;
> >
> > This looks sensible to me after thinking about it for a while. But I also have a
> > gut feeling that there might be some more subtleties that are going over my
> > head, since I'm not expert in this area. So will leave others to provide R-b :)
> >
>
> As we are seeing more such problems with lockless PT walks, maybe we
> really want some other special value (nonswap entry?) to indicate that a
> PTE this is currently ondergoing protection changes. So we'd avoid the
> pte_none() temporarily, if possible.
>
> Without that, TTU_SYNC feels like the right thing to do.
>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry