Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio

From: David Hildenbrand
Date: Mon Mar 04 2024 - 16:15:18 EST



Do we need a Fixes tag?

I am not quite sure which commit should be here for a fixes tag.
I think it's more of an optimization.

Good, that helps!




What would be the description of the problem we are fixing?

1) failing to unmap?

That can happen with small folios as well IIUC.

2) Putting the large folio on the deferred split queue?

That sounds more reasonable.

I don't feel it is reasonable. Avoiding this kind of accident splitting
from the kernel's improper code is a more reasonable approach
as there is always a price to pay for splitting and unfolding PTEs
etc.

While we can't avoid splitting coming from userspace's
MADV_DONTNEED, munmap, mprotect, we have a way
to ensure the kernel itself doesn't accidently break up a
large folio.

Note that on the next vmscan we would retry, find the remaining present entries and swapout that thing completely :)


In OPPO's phones, we ran into some weird bugs due to skipped PTEs
in try_to_unmap_one. hardly could we fix it from the root cause. with
various races, figuring out their timings was really a big pain :-)


I can imagine. I assume, though, that it might be related to the way the cont-pte bit was handled. Ryan's implementation should be able to cope with that.

But we did "resolve" those bugs by entirely untouching all PTEs if we
found some PTEs were skipped in try_to_unmap_one [1].

While we find we only get the PTL from 2nd, 3rd but not
1st PTE, we entirely give up on try_to_unmap_one, and leave
all PTEs untouched.

/* we are not starting from head */
if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
ret = false;
atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
set_pte_at(mm, address, pvmw.pte, pteval);
page_vma_mapped_walk_done(&pvmw);
break;
}
This will ensure all PTEs still have a unified state such as CONT-PTE
after try_to_unmap fails.
I feel this could have some false postive because when racing
with unmap, 1st PTE might really become pte_none. So explicitly
holding PTL from 1st PTE seems a better way.

Can we estimate the "cost" of holding the PTL?

--
Cheers,

David / dhildenb