Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

From: David Hildenbrand
Date: Wed Feb 01 2023 - 10:58:29 EST


On 01.02.23 16:45, James Houghton wrote:
On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@xxxxxxxxxx> wrote:

On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@xxxxxxxxxx> wrote:

On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@xxxxxxxxxx> wrote:

On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
[snip]
Another way to not use thp mapcount, nor break smaps and similar calls to
page_mapcount() on small page, is to only increase the hpage mapcount only
when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
as leaf or a non-leaf), and the mapcount can be decreased when the pXd
entry is removed (for leaf, it's the same as for now; for HGM, it's when
freeing pgtable of the PUD entry).

Right, and this is doable. Also it seems like this is pretty close to
the direction Matthew Wilcox wants to go with THPs.

I may not be familiar with it, do you mean this one?

https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@xxxxxxxxxxxxxxxxxxxx/

Yep that's it.


For hugetlb I think it should be easier to maintain rather than any-sized
folios, because there's the pgtable non-leaf entry to track rmap
information and the folio size being static to hpage size.

It'll be different to folios where it can be random sized pages chunk, so
it needs to be managed by batching the ptes when install/zap.

Agreed. It's probably easier for HugeTLB because they're always
"naturally aligned" and yeah they can't change sizes.



Something I noticed though, from the implementation of
folio_referenced()/folio_referenced_one(), is that folio_mapcount()
ought to report the total number of PTEs that are pointing on the page
(or the number of times page_vma_mapped_walk returns true). FWIW,
folio_referenced() is never called for hugetlb folios.

FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
it'll walk every leaf page being mapped, big or small, so IIUC that number
should match with what it expects to see later, more or less.

I don't fully understand what you mean here.

I meant the rmap_walk pairing with folio_referenced_one() will walk all the
leaves for the folio, big or small. I think that will match the number
with what got returned from folio_mapcount().

See below.




But I agree the mapcount/referenced value itself is debatable to me, just
like what you raised in the other thread on page migration. Meanwhile, I
am not certain whether the mapcount is accurate either because AFAICT the
mapcount can be modified if e.g. new page mapping established as long as
before taking the page lock later in folio_referenced().

It's just that I don't see any severe issue either due to any of above, as
long as that information is only used as a hint for next steps, e.g., to
swap which page out.

I also don't see a big problem with folio_referenced() (and you're
right that folio_mapcount() can be stale by the time it takes the
folio lock). It still seems like folio_mapcount() should return the
total number of PTEs that map the page though. Are you saying that
breaking this would be ok?

I didn't quite follow - isn't that already doing so?

folio_mapcount() is total_compound_mapcount() here, IIUC it is an
accumulated value of all possible PTEs or PMDs being mapped as long as it's
all or part of the folio being mapped.

We've talked about 3 ways of handling mapcount:

1. The RFC v2 way, which is head-only, and we increment the compound
mapcount for each PT mapping we have. So a PTE-mapped 2M page,
compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
2. The THP-like way. If we are fully mapping the hugetlb page with the
hstate-level PTE, we increment the compound mapcount, otherwise we
increment subpage->_mapcount.
3. The RFC v1 way (the way you have suggested above), which is
head-only, and we increment the compound mapcount if the hstate-level
PTE is made present.

With #1 and #2, there is no concern with folio_mapcount(). But with
#3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
would yield 1 instead of 512 (right?). That's what I mean.

My 2 cents:

The mapcount is primarily used (in hugetlb context) to

(a) Detect if a page might be shared. mapcount > 1 implies that two independent page table hierarchies are mapping the page. We care about mapcount == 1 vs. mapcount != 1.

(b) Detect if unmapping was sucessfull. We care about mapcount == 0 vs. mapcount != 0.

For hugetlb, I don't see why we should care about the subpage mapcount at all.

For (a) it's even good to count "somehow mapped into a single page table structure" as "mapcount == 1" For (b), we don't care as long as "still mapped" implies "mapcount != 0".

--
Thanks,

David / dhildenb