Re: [LSF/MM TOPIC] Non standard size THP

From: Kirill A. Shutemov
Date: Wed Feb 13 2019 - 08:49:07 EST


On Wed, Feb 13, 2019 at 06:20:03PM +0530, Anshuman Khandual wrote:
>
>
> On 02/12/2019 02:03 PM, Kirill A. Shutemov wrote:
> > On Fri, Feb 08, 2019 at 07:43:57AM +0530, Anshuman Khandual wrote:
> >> Hello,
> >>
> >> THP is currently supported for
> >>
> >> - PMD level pages (anon and file)
> >> - PUD level pages (file - DAX file system)
> >>
> >> THP is a single entry mapping at standard page table levels (either PMD or PUD)
> >>
> >> But architectures like ARM64 supports non-standard page table level huge pages
> >> with contiguous bits.
> >>
> >> - These are created as multiple entries at either PTE or PMD level
> >> - These multiple entries carry pages which are physically contiguous
> >> - A special PTE bit (PTE_CONT) is set indicating single entry to be contiguous
> >>
> >> These multiple contiguous entries create a huge page size which is different
> >> than standard PMD/PUD level but they provide benefits of huge memory like
> >> less number of faults, bigger TLB coverage, less TLB miss etc.
> >>
> >> Currently they are used as HugeTLB pages because
> >>
> >> - HugeTLB page sizes is carried in the VMA
> >> - Page table walker can operate on multiple PTE or PMD entries given its size in VMA
> >> - Irrespective of HugeTLB page size its operated with set_huge_pte_at() at any level
> >> - set_huge_pte_at() is arch specific which knows how to encode multiple consecutive entries
> >>
> >> But not as THP huge pages because
> >>
> >> - THP size is not encoded any where like VMA
> >> - Page table walker expects it to be either at PUD (HPAGE_PUD_SIZE) or at PMD (HPAGE_PMD_SIZE)
> >> - Page table operates directly with set_pmd_at() or set_pud_at()
> >> - Direct faulted or promoted huge pages is verified with [pmd|pud]_trans_huge()
> >>
> >> How non-standard huge pages can be supported for THP
> >>
> >> - THP starts recognizing non standard huge page (exported by arch) like HPAGE_CONT_(PMD|PTE)_SIZE
> >> - THP starts operating for either on HPAGE_PMD_SIZE or HPAGE_CONT_PMD_SIZE or HPAGE_CONT_PTE_SIZE
> >> - set_pmd_at() only recognizes HPAGE_PMD_SIZE hence replace set_pmd_at() with set_huge_pmd_at()
> >> - set_huge_pmd_at() could differentiate between HPAGE_PMD_SIZE or HPAGE_CONT_PMD_SIZE
> >> - In case for HPAGE_CONT_PTE_SIZE extend page table walker till PTE level
> >> - Use set_huge_pte_at() which can operate on multiple contiguous PTE bits
> >
> > You only listed trivial things. All tricky stuff is what make THP
> > transparent.
>
> Agreed. I was trying to draw an analogy from HugeTLB with respect to page
> table creation and it's walking. Huge page collapse and split on such non
> standard huge pages will involve taking care of much details.
>
> >
> > To consider it seriously we need to understand what it means for
> > split_huge_p?d()/split_huge_page()? How khugepaged will deal with this?
>
> Absolutely. Can these operate on non standard probably multi entry based
> huge pages ? How to handle atomicity etc.

We need to handle split for them to provide transparency.

> > In particular, I'm worry to expose (to user or CPU) page table state in
> > the middle of conversion (huge->small or small->huge). Handling this on
> > page table level provides a level atomicity that you will not have.
>
> I understand it might require a software based lock instead of standard HW
> atomicity constructs which will make it slow but is that even possible ?

I'm not yet sure if it is possible. I don't yet wrap my head around the
idea yet.

> > Honestly, I'm very skeptical about the idea. It took a lot of time to
> > stabilize THP for singe page size, equal to PMD page table, but this looks
> > like a new can of worms. :P
>
> I understand your concern here but HW providing some more TLB sizes beyond
> standard page table level (PMD/PUD/PGD) based huge pages can help achieve
> performance improvement when the buddy is already fragmented enough not to
> provide higher order pages. PUD THP file mapping is already supported for
> DAX and PUD THP anon mapping might be supported in near future (it is not
> much challenging other than allocating HPAGE_PUD_SIZE huge page at runtime
> will be much difficult).

That's a bold claim. I would like to look at code. :)

Supporting more than one THP page size at the same time brings a lot more
questions, besides allocation path (although I'm sure compaction will be
happy about this).

For instance, what page size you'll allocate for a given fault
address?

How do you deal with pre-allocated page tables? Deposit 513 page tables
for a given PUD THP page might be fun. :P

> Around PMD sizes like HPAGE_CONT_PMD_SIZE or
> HPAGE_CONT_PTE_SIZE really have better chances as future non-PMD level anon
> mapping than a PUD size anon mapping support in THP.
>
> >
> > It *might* be possible to support it for DAX, but beyond that...
> >
>
> Did not get that. Why would you think that this is possible or appropriate
> only for DAX file mapping but not for anon mapping ?

DAX THP is inherently simpler: no struct pages -- less state to track and
no need in split_huge_page(), split_huge_p?d() can be handled by dropping
entities in question and re-faulting them as smaller entires. No problem
with compation...

--
Kirill A. Shutemov