Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

From: Ryan Roberts
Date: Mon Jul 17 2023 - 10:48:16 EST


On 17/07/2023 14:56, David Hildenbrand wrote:
> On 17.07.23 15:20, Ryan Roberts wrote:
>> On 17/07/2023 14:06, David Hildenbrand wrote:
>>> On 14.07.23 19:17, Yu Zhao wrote:
>>>> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>>>>>
>>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>>>> allocated in large folios of a determined order. All pages of the large
>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>> counting, rmap management lru list management) are also significantly
>>>>> reduced since those ops now become per-folio.
>>>>>
>>>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>>>>> defaults to disabled for now; The long term aim is for this to defaut to
>>>>> enabled, but there are some risks around internal fragmentation that
>>>>> need to be better understood first.
>>>>>
>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>> fragmentation so we honour that request.
>>>>>
>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>> arch_wants_pte_order() is limited by the new cmdline parameter,
>>>>> `flexthp_unhinted_max`. This allows for a performance boost without
>>>>> requiring any explicit opt-in from the workload while allowing the
>>>>> sysadmin to tune between performance and internal fragmentation.
>>>>>
>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>> mechanism allows the architecture to optimize as required.
>>>>>
>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>> mapped) then we fall back to a suitable lower order; first
>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
>>>>> ---
>>>>>    .../admin-guide/kernel-parameters.txt         |  10 +
>>>>>    mm/Kconfig                                    |  10 +
>>>>>    mm/memory.c                                   | 187 ++++++++++++++++--
>>>>>    3 files changed, 190 insertions(+), 17 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
>>>>> b/Documentation/admin-guide/kernel-parameters.txt
>>>>> index a1457995fd41..405d624e2191 100644
>>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>>> @@ -1497,6 +1497,16 @@
>>>>>                           See Documentation/admin-guide/sysctl/net.rst for
>>>>>                           fb_tunnels_only_for_init_ns
>>>>>
>>>>> +       flexthp_unhinted_max=
>>>>> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The
>>>>> maximum
>>>>> +                       folio size that will be allocated for an anonymous vma
>>>>> +                       that has neither explicitly opted in nor out of using
>>>>> +                       transparent hugepages. The size must be a
>>>>> power-of-2 in
>>>>> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size
>>>>> improves
>>>>> +                       performance by reducing page faults, while a smaller
>>>>> +                       size reduces internal fragmentation. Default: max(64K,
>>>>> +                       PAGE_SIZE). Format: size[KMG].
>>>>> +
>>>>
>>>> Let's split this parameter into a separate patch.
>>>>
>>>
>>> Just a general comment after stumbling over patch #2, let's not start splitting
>>> patches into things that don't make any sense on their own; that just makes
>>> review a lot harder.
>>
>> ACK
>>
>>>
>>> For this case here, I'd suggest first adding the general infrastructure and then
>>> adding tunables we want to have on top.
>>
>> OK, so 1 patch for the main infrastructure, then a patch to disable for
>> MADV_NOHUGEPAGE and friends, then a further patch to set flexthp_unhinted_max
>> via a sysctl?
>
> MADV_NOHUGEPAGE handling for me falls under the category "required for
> correctness to not break existing workloads" and has to be there initially.
>
> Anything that is rather a performance tunable (e.g., a sysctl to optimize) can
> be added on top and discussed separately.>
> At least IMHO :)
>
>>
>>>
>>> I agree that toggling that at runtime (for example via sysfs as raised by me
>>> previously) would be nicer.
>>
>> OK, I clearly misunderstood, I thought you were requesting a boot parameter.
>
> Oh, sorry about that. I wanted to actually express
> "/sys/kernel/mm/transparent_hugepage/" sysctls where we can toggle that later at
> runtime as well.
>
>> What's the ABI compat guarrantee for sysctls? I assumed that for a boot
>> parameter it would be easier to remove in future if we wanted, but for sysctl,
>> its there forever?
>
> sysctl are hard/impossible to remove, yes. So we better make sure what we add
> has clear semantics.
>
> If we ever want some real auto-tunable mode (and can actually implement it
> without harming performance; and I am skeptical), we might want to allow for
> setting such a parameter to "auto", for example.
>
>>
>> Also, how do you feel about the naming and behavior of the parameter?
>
> Very good question. "flexthp_unhinted_max" naming is a bit suboptimal.
>
> For example, I'm not so sure if we should expose the feature to user space as
> "flexthp" at all. I think we should find a clearer feature name to begin with.
>
> ... maybe we can initially get away with dropping that parameter and default to
> something reasonably small (i.e., 64k as you have above)?

That would certainly get my vote. But it was you who was arguing for a tunable
previously ;-). I propose we use the following as the "unhinted ceiling" for
now, then we can add a tunable if/when we find a use case that doesn't work
optimally with this value:

static int flexthp_unhinted_max_order =
ilog2(SZ_64K > PAGE_SIZE ? SZ_64K : PAGE_SIZE) - PAGE_SHIFT;

(Using PAGE_SIZE when its gt 64K to cover the ppc case that looks like it can
support 256K pages. Open coding the max because max() can't be used outside a
function).

>
> /sys/kernel/mm/transparent_hugepage/enabled=never and simply not get any thp.

Yes, that should work with the patch as it is today.

>