Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

From: John Hubbard
Date: Mon Nov 13 2023 - 09:52:41 EST

Next message: Kefeng Wang: "Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory"
Previous message: David Hildenbrand: "Re: [RFC PATCH] mm: support large folio numa balancing"
In reply to: Kefeng Wang: "Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory"
Next in thread: Matthew Wilcox: "Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 11/13/23 2:19 AM, Ryan Roberts wrote:

On 13/11/2023 05:18, Matthew Wilcox wrote:

On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:

I've done some initial performance testing of this patchset on an arm64
SBSA server. When these patches are combined with the arm64 arch contpte
patches in Ryan's git tree (he has conveniently combined everything
here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
some memory-intensive workloads. Many test runs, conducted independently
by different engineers and on different machines, have convinced me and
my colleagues that this is an accurate result.

In order to achieve that result, we used the git tree in [1] with
following settings:

echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders

This was on a aarch64 machine configure to use a 64KB base page size.
That configuration means that the PMD size is 512MB, which is of course
too large for practical use as a pure PMD-THP. However, with with these
small-size (less than PMD-sized) THPs, we get the improvements in TLB
coverage, while still getting pages that are small enough to be
effectively usable.

That is quite remarkable!

Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!

My hope is to abolish the 64kB page size configuration. ie instead of

We've found that a 64KB base page size provides better performance for
HPC and AI workloads, than a 4KB base size, at least for these kinds of
servers. In fact, the 4KB config is considered odd and I'd have to
look around to get one. It's mostly a TLB coverage issue because,
again, the problem typically has a very large memory footprint.

So even though it would be nice from a software point of view, there's
a real need for this.

using the mixture of page sizes that you currently are -- 64k and
1M (right? Order-0, and order-4)

Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
intuitively you would expect the order to remain constant, but it doesn't).

The "recommend" setting above will actually enable order-3 as well even though
there is no HW benefit to this. So the full set of available memory sizes here is:

64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13

Yes, and to provide some further details about the test runs, I went
so far as to test individual anon_orders (for example, anon_orders=0x20), in order to isolate behavior and see what's really
going on.

On this hardware, anything with 2MB page sizes which corresponds to
anon_orders=0x20, as I recall) or larger, gets the 10x boost. It's
an interesting on/off behavior. This particular server design and
workload combination really prefers 2MB pages, even if they are
held together with contpte instead of a real PMD entry.

, that 4k, 64k and 2MB (order-0,
order-4 and order-9) will provide better performance.

Have you run any experiements with a 4kB page size?

Agree that would be interesting with 64K small-sized THP enabled. And I'd love
to get to a world were we universally deal in variable sized chunks of memory,
aligned on 4K boundaries.

In my experience though, there are still some performance benefits to 64K base
page vs 4K+contpte; the page tables are more cache efficient for the former case
- 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
latter. In practice the HW will still only read 8 bytes in the latter but that's
taking up a full cache line vs the former where a single cache line stores 8x
64K entries. >
Thanks,
Ryan

thanks,

--
John Hubbard
NVIDIA

Next message: Kefeng Wang: "Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory"
Previous message: David Hildenbrand: "Re: [RFC PATCH] mm: support large folio numa balancing"
In reply to: Kefeng Wang: "Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory"
Next in thread: Matthew Wilcox: "Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]