Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio

From: Barry Song
Date: Mon Mar 04 2024 - 17:29:54 EST


On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
>
> >>> Do we need a Fixes tag?
> >
> > I am not quite sure which commit should be here for a fixes tag.
> > I think it's more of an optimization.
>
> Good, that helps!
>
> >
> >>>
> >>
> >> What would be the description of the problem we are fixing?
> >>
> >> 1) failing to unmap?
> >>
> >> That can happen with small folios as well IIUC.
> >>
> >> 2) Putting the large folio on the deferred split queue?
> >>
> >> That sounds more reasonable.
> >
> > I don't feel it is reasonable. Avoiding this kind of accident splitting
> > from the kernel's improper code is a more reasonable approach
> > as there is always a price to pay for splitting and unfolding PTEs
> > etc.
> >
> > While we can't avoid splitting coming from userspace's
> > MADV_DONTNEED, munmap, mprotect, we have a way
> > to ensure the kernel itself doesn't accidently break up a
> > large folio.
>
> Note that on the next vmscan we would retry, find the remaining present
> entries and swapout that thing completely :)

This is true, but since we can finish the job the first time, it seems
second retry is a cost :-)

>
> >
> > In OPPO's phones, we ran into some weird bugs due to skipped PTEs
> > in try_to_unmap_one. hardly could we fix it from the root cause. with
> > various races, figuring out their timings was really a big pain :-)
> >
>
> I can imagine. I assume, though, that it might be related to the way the
> cont-pte bit was handled. Ryan's implementation should be able to cope
> with that.

I guess you are probably right. Ryan's implementation decouples CONT-PTE
from mm core. nice to have it.

>
> > But we did "resolve" those bugs by entirely untouching all PTEs if we
> > found some PTEs were skipped in try_to_unmap_one [1].
> >
> > While we find we only get the PTL from 2nd, 3rd but not
> > 1st PTE, we entirely give up on try_to_unmap_one, and leave
> > all PTEs untouched.
> >
> > /* we are not starting from head */
> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
> > ret = false;
> > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
> > set_pte_at(mm, address, pvmw.pte, pteval);
> > page_vma_mapped_walk_done(&pvmw);
> > break;
> > }
> > This will ensure all PTEs still have a unified state such as CONT-PTE
> > after try_to_unmap fails.
> > I feel this could have some false postive because when racing
> > with unmap, 1st PTE might really become pte_none. So explicitly
> > holding PTL from 1st PTE seems a better way.
>
> Can we estimate the "cost" of holding the PTL?
>

This is just moving PTL acquisition one or two PTE earlier in those corner
cases. In normal cases, it doesn't affect when PTL is held.

In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold
PTL immediately. in corner cases, page_vma_mapped_walk races with break-
before-make, after skipping one or two PTEs whose states are transferring,
it will find a present pte then acquire lock.

> --
> Cheers,
>
> David / dhildenb

Thanks
Barry