Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

From: Ryan Roberts
Date: Thu Nov 23 2023 - 14:11:34 EST


On 23/11/2023 17:22, Peter Xu wrote:
> On Thu, Nov 23, 2023 at 03:47:49PM +0000, Matthew Wilcox wrote:
>> It looks like ARM (in the person of Ryan) are going to add support for
>> something equivalent to hugepd.
>
> If it's about arm's cont_pte, then it looks ideal because this series
> didn't yet touch cont_pte, assuming it'll just work. From that aspect, his
> work may help mine, and no immediately collapsing either.

Hi,

I'm not sure I've 100% understood the crossover between this series and my work
to support arm64's contpte mappings generally for anonymous and file-backed memory.

My approach is to transparently use contpte mappings when core-mm request pte
mappings that meet the requirements; and its all based around intercepting the
normal (non-hugetlb) helpers (e.g. set_ptes(), ptep_get() and friends). There is
no semantic change to the core-mm. See [1]. It relies on 1) the page cache using
large folios and 2) my "small-sized THP" series which starts using arbitrary
sized large folios for anonymous memory [2].

If I've understood this conversation correctly there is an object called hugepd,
which today is only supported by powerpc, but which could allow the core-mm to
control the mapping granularity? I can see some value in exposing that control
to core-mm in the (very) long term.

[1] https://lore.kernel.org/all/20231115163018.1303287-1-ryan.roberts@xxxxxxx/
[2] https://lore.kernel.org/linux-mm/20231115132734.931023-1-ryan.roberts@xxxxxxx/

Thanks,
Ryan


>
> There can be a slight performance difference which I need to measure for
> arm's cont_pte already for hugetlb, but I didn't worry much on that;
> quotting my commit message in the last patch:
>
> There may be a slight difference of how the loops run when processing
> GUP over a large hugetlb range on either ARM64 (e.g. CONT_PMD) or RISCV
> (mostly its Svnapot extension on 64K huge pages): each loop of
> __get_user_pages() will resolve one pgtable entry with the patch
> applied, rather than relying on the size of hugetlb hstate, the latter
> may cover multiple entries in one loop.
>
> However, the performance difference should hopefully not be a major
> concern, considering that GUP just yet got 57edfcfd3419 ("mm/gup:
> accelerate thp gup even for "pages != NULL""), and that's not part of a
> performance analysis but a side dish. If the performance will be a
> concern, we can consider handle CONT_PTE in follow_page(), for example.
>
> So IMHO it can be slightly different comparing to e.g. page fault, because
> each fault is still pretty slow as a whole if one fault for each small pte
> (of a large folio / cont_pte), while the loop in GUP is still relatively
> tight and short, comparing to a fault. I'd boldly guess more low hanging
> fruits out there for large folio outside GUP areas.
>
> In all cases, it'll be interesting to know if Ryan has worked on cont_pte
> support for gup on large folios, and whether there's any performance number
> to share. It's definitely good news to me because it means Ryan's work can
> also then benefit hugetlb if this series will be merged, I just don't know
> how much difference there will be.
>
> Thanks,
>