Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

From: Barry Song
Date: Mon Nov 27 2023 - 05:00:18 EST


On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> On 27/11/2023 08:42, Barry Song wrote:
> >>> + for (i = 0; i < nr; i++, page++) {
> >>> + if (anon) {
> >>> + /*
> >>> + * If this page may have been pinned by the
> >>> + * parent process, copy the page immediately for
> >>> + * the child so that we'll always guarantee the
> >>> + * pinned page won't be randomly replaced in the
> >>> + * future.
> >>> + */
> >>> + if (unlikely(page_try_dup_anon_rmap(
> >>> + page, false, src_vma))) {
> >>> + if (i != 0)
> >>> + break;
> >>> + /* Page may be pinned, we have to copy. */
> >>> + return copy_present_page(
> >>> + dst_vma, src_vma, dst_pte,
> >>> + src_pte, addr, rss, prealloc,
> >>> + page);
> >>> + }
> >>> + rss[MM_ANONPAGES]++;
> >>> + VM_BUG_ON(PageAnonExclusive(page));
> >>> + } else {
> >>> + page_dup_file_rmap(page, false);
> >>> + rss[mm_counter_file(page)]++;
> >>> + }
> >>> }
> >>> - rss[MM_ANONPAGES]++;
> >>> - } else if (page) {
> >>> - folio_get(folio);
> >>> - page_dup_file_rmap(page, false);
> >>> - rss[mm_counter_file(page)]++;
> >>> +
> >>> + nr = i;
> >>> + folio_ref_add(folio, nr);
> >>
> >> You're changing the order of mapcount vs. refcount increment. Don't.
> >> Make sure your refcount >= mapcount.
> >>
> >> You can do that easily by doing the folio_ref_add(folio, nr) first and
> >> then decrementing in case of error accordingly. Errors due to pinned
> >> pages are the corner case.
> >>
> >> I'll note that it will make a lot of sense to have batch variants of
> >> page_try_dup_anon_rmap() and page_dup_file_rmap().
> >>
> >
> > i still don't understand why it is not a entire map+1, but an increment
> > in each basepage.
>
> Because we are PTE-mapping the folio, we have to account each individual page.
> If we accounted the entire folio, where would we unaccount it? Each page can be
> unmapped individually (e.g. munmap() part of the folio) so need to account each
> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
> atomic, so we can account the entire thing.

Hi Ryan,

There is no problem. for example, a large folio is entirely mapped in
process A with CONPTE,
and only page2 is mapped in process B.
then we will have

entire_map = 0
page0.map = -1
page1.map = -1
page2.map = 0
page3.map = -1
....

>
> >
> > as long as it is a CONTPTE large folio, there is no much difference with
> > PMD-mapped large folio. it has all the chance to be DoubleMap and need
> > split.
> >
> > When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> > similar things on a part of the large folio in process A,
> >
> > this large folio will have partially mapped subpage in A (all CONTPE bits
> > in all subpages need to be removed though we only unmap a part of the
> > large folioas HW requires consistent CONTPTEs); and it has entire map in
> > process B(all PTEs are still CONPTES in process B).
> >
> > isn't it more sensible for this large folios to have entire_map = 0(for
> > process B), and subpages which are still mapped in process A has map_count
> > =0? (start from -1).
> >
> >> Especially, the batch variant of page_try_dup_anon_rmap() would only
> >> check once if the folio maybe pinned, and in that case, you can simply
> >> drop all references again. So you either have all or no ptes to process,
> >> which makes that code easier.
>
> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
> fundamentally you can only use entire_mapcount if its only possible to map and
> unmap the whole folio atomically.



My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
it is partially
mapped. if a large folio is mapped in one processes with all CONTPTEs
and meanwhile in another process with partial mapping(w/o CONTPTE), it is
DoubleMapped.

Since we always hold ptl to set or drop CONTPTE bits, set/drop is
still atomic in a
spinlock area.

>
> >>
> >> But that can be added on top, and I'll happily do that.
> >>
> >> --
> >> Cheers,
> >>
> >> David / dhildenb
> >

Thanks
Barry