Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

From: David Hildenbrand
Date: Mon Jul 17 2023 - 09:57:48 EST


On 17.07.23 15:20, Ryan Roberts wrote:
On 17/07/2023 14:06, David Hildenbrand wrote:
On 14.07.23 19:17, Yu Zhao wrote:
On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:

Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
allocated in large folios of a determined order. All pages of the large
folio are pte-mapped during the same page fault, significantly reducing
the number of page faults. The number of per-page operations (e.g. ref
counting, rmap management lru list management) are also significantly
reduced since those ops now become per-folio.

The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
defaults to disabled for now; The long term aim is for this to defaut to
enabled, but there are some risks around internal fragmentation that
need to be better understood first.

When enabled, the folio order is determined as such: For a vma, process
or system that has explicitly disabled THP, we continue to allocate
order-0. THP is most likely disabled to avoid any possible internal
fragmentation so we honour that request.

Otherwise, the return value of arch_wants_pte_order() is used. For vmas
that have not explicitly opted-in to use transparent hugepages (e.g.
where thp=madvise and the vma does not have MADV_HUGEPAGE), then
arch_wants_pte_order() is limited by the new cmdline parameter,
`flexthp_unhinted_max`. This allows for a performance boost without
requiring any explicit opt-in from the workload while allowing the
sysadmin to tune between performance and internal fragmentation.

arch_wants_pte_order() can be overridden by the architecture if desired.
Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
set of ptes map physically contigious, naturally aligned memory, so this
mechanism allows the architecture to optimize as required.

If the preferred order can't be used (e.g. because the folio would
breach the bounds of the vma, or because ptes in the region are already
mapped) then we fall back to a suitable lower order; first
PAGE_ALLOC_COSTLY_ORDER, then order-0.

Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
---
  .../admin-guide/kernel-parameters.txt         |  10 +
  mm/Kconfig                                    |  10 +
  mm/memory.c                                   | 187 ++++++++++++++++--
  3 files changed, 190 insertions(+), 17 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt
b/Documentation/admin-guide/kernel-parameters.txt
index a1457995fd41..405d624e2191 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1497,6 +1497,16 @@
                         See Documentation/admin-guide/sysctl/net.rst for
                         fb_tunnels_only_for_init_ns

+       flexthp_unhinted_max=
+                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
+                       folio size that will be allocated for an anonymous vma
+                       that has neither explicitly opted in nor out of using
+                       transparent hugepages. The size must be a power-of-2 in
+                       the range [PAGE_SIZE, PMD_SIZE). A larger size improves
+                       performance by reducing page faults, while a smaller
+                       size reduces internal fragmentation. Default: max(64K,
+                       PAGE_SIZE). Format: size[KMG].
+

Let's split this parameter into a separate patch.


Just a general comment after stumbling over patch #2, let's not start splitting
patches into things that don't make any sense on their own; that just makes
review a lot harder.

ACK


For this case here, I'd suggest first adding the general infrastructure and then
adding tunables we want to have on top.

OK, so 1 patch for the main infrastructure, then a patch to disable for
MADV_NOHUGEPAGE and friends, then a further patch to set flexthp_unhinted_max
via a sysctl?

MADV_NOHUGEPAGE handling for me falls under the category "required for correctness to not break existing workloads" and has to be there initially.

Anything that is rather a performance tunable (e.g., a sysctl to optimize) can be added on top and discussed separately.

At least IMHO :)



I agree that toggling that at runtime (for example via sysfs as raised by me
previously) would be nicer.

OK, I clearly misunderstood, I thought you were requesting a boot parameter.

Oh, sorry about that. I wanted to actually express "/sys/kernel/mm/transparent_hugepage/" sysctls where we can toggle that later at runtime as well.

What's the ABI compat guarrantee for sysctls? I assumed that for a boot
parameter it would be easier to remove in future if we wanted, but for sysctl,
its there forever?

sysctl are hard/impossible to remove, yes. So we better make sure what we add has clear semantics.

If we ever want some real auto-tunable mode (and can actually implement it without harming performance; and I am skeptical), we might want to allow for setting such a parameter to "auto", for example.


Also, how do you feel about the naming and behavior of the parameter?

Very good question. "flexthp_unhinted_max" naming is a bit suboptimal.

For example, I'm not so sure if we should expose the feature to user space as "flexthp" at all. I think we should find a clearer feature name to begin with.

... maybe we can initially get away with dropping that parameter and default to something reasonably small (i.e., 64k as you have above)?

/sys/kernel/mm/transparent_hugepage/enabled=never and simply not get any thp.

--
Cheers,

David / dhildenb