Re: [PATCH v2 0/1] change ->index to PAGE_SIZE for hugetlb pages

From: Matthew Wilcox
Date: Sat Jul 22 2023 - 00:18:40 EST


On Wed, Jul 19, 2023 at 05:00:11PM -0700, Mike Kravetz wrote:
> On 07/10/23 16:04, Sidhartha Kumar wrote:
> > ========================== OVERVIEW ========================================
> > This patchset attempts to implement a listed filemap TODO which is
> > changing hugetlb folios to have ->index in PAGE_SIZE. This simplifies many
> > functions within filemap.c as they have to special case hugetlb pages.
> > From the RFC v1[1], Mike pointed out that hugetlb will still have to maintain
> > a huge page sized index as it is used for the reservation map and the hash
> > function for the hugetlb mutex table.
> >
> > This patchset adds new wrappers for hugetlb code to to interact with the
> > page cache. These wrappers calculate a linear page index as this is now
> > what the page cache expects for hugetlb pages.
> >
> > From the discussion on HGM for hugetlb[3], there is a want to remove hugetlb
> > special casing throughout the core mm code. This series accomplishes
> > a part of this by shifting complexity from filemap.c to hugetlb.c. There
> > are still checks for hugetlb within the filemap code as cgroup accounting
> > and hugetlb accounting are special cased as well.
> >
> > =========================== PERFORMANCE =====================================
>
> Hi Sid,
>
> Sorry for being dense but can you tell me what the below performance
> information means. My concern with such a change would be any noticeable
> difference in populating a large (up to TB) hugetlb file. My guess is
> that it is going to take longer unless xarray is optimized for this.
>
> We do have users that create and pre-populate hugetlb files this big.
> Just want to make sure there are no surprises for them.

It's Going To Depend. Annoyingly.

Let's say you're using 1GB pages on a 4kB PAGE_SIZE machine. That's an
order-18 folio, so we end up skipping three layers of the tree, and if
you're going up to 1TB, it's structured:

root -> node (shift 30) -> node (shift 24) -> entry
-> entry (...)
-> node (shift 24) -> entry
(...)
(...)

This is essentially no different from before where each 1GB page would
occupy a single entry. It's just that it now occupies 2^18 entries,
and everything in the tree has a different label.

Where you will (may?) see a difference is with the 2MB entries.
An order-9 page doesn't quite fit with the order-6 nodes in the tree,
so it looks like this:

root -> node (s30) -> node (s24) -> node (s18) -> node (s12) -> entry 0
-> sibling
-> sibling
(...)
-> entry 8
-> sibling
-> sibling
(...)

so all of a sudden the tree is 8x as big as it used to be. The upside
is that we lose all the calculations from filemap.c/pagemap.h. It's a
lot better than it was perhaps five years ago when each 2MB page would
occupy 512 entries, but 8 entries is still worse than 1.

Could we do better? Undoubtedly. We could have variable shifts & node
sizes in the tree so that we perhaps had an s18 node that was 8x as large
(4160 bytes), and then each order-9 entry in the tree would occupy one
entry in that special large node. I've been reluctant to introduce such
a beast without strong evidence it would help. Or we could introduce a
small s12 node which could only store 8 entries (again an order-9 entry
would occupy one entry in such a special node).

These are things which would only benefit hugetlbfs, so there's a bit
of a chicken-and-egg problem; no demand for the feature until the work
is done, and the work maybe performs badly until the feature exists.

And then some architectures have other orders for their huge pages.
Order 11 is probably the worst possibility to exist (or in general 6n -
1), but I haven't done a detailed survey to figure out if anyone supports
such a thing.