Re: [PATCH v4 2/5] mm: LARGE_ANON_FOLIO for improved performance

From: Ryan Roberts
Date: Mon Aug 07 2023 - 15:00:38 EST

On 04/08/2023 19:53, Yu Zhao wrote:
> On Fri, Aug 4, 2023 at 3:06 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>> On 04/08/2023 01:19, Yu Zhao wrote:
>>> On Thu, Aug 3, 2023 at 8:27 AM Kirill A. Shutemov
>>> <kirill.shutemov@xxxxxxxxxxxxxxx> wrote:
>>>> On Thu, Aug 03, 2023 at 01:43:31PM +0100, Ryan Roberts wrote:
>>>>> + Kirill
>>>>> On 26/07/2023 10:51, Ryan Roberts wrote:
>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>> counting, rmap management lru list management) are also significantly
>>>>>> reduced since those ops now become per-folio.
>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
>>>>>> which defaults to disabled for now; The long term aim is for this to
>>>>>> defaut to enabled, but there are some risks around internal
>>>>>> fragmentation that need to be better understood first.
>>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>>> fragmentation so we honour that request.
>>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
>>>>>> bigger). This allows for a performance boost without requiring any
>>>>>> explicit opt-in from the workload while limitting internal
>>>>>> fragmentation.
>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>> ...
>>>>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
>>>>>> +
>>>>>> +static int anon_folio_order(struct vm_area_struct *vma)
>>>>>> +{
>>>>>> + int order;
>>>>>> +
>>>>>> + /*
>>>>>> + * If THP is explicitly disabled for either the vma, the process or the
>>>>>> + * system, then this is very likely intended to limit internal
>>>>>> + * fragmentation; in this case, don't attempt to allocate a large
>>>>>> + * anonymous folio.
>>>>>> + *
>>>>>> + * Else, if the vma is eligible for thp, allocate a large folio of the
>>>>>> + * size preferred by the arch. Or if the arch requested a very small
>>>>>> + * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>>>>>> + * which still meets the arch's requirements but means we still take
>>>>>> + * advantage of SW optimizations (e.g. fewer page faults).
>>>>>> + *
>>>>>> + * Finally if thp is enabled but the vma isn't eligible, take the
>>>>>> + * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
>>>>>> + * This ensures workloads that have not explicitly opted-in take benefit
>>>>>> + * while capping the potential for internal fragmentation.
>>>>>> + */
>>>>>> +
>>>>>> + if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>>>>> + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>>>>>> + !hugepage_flags_enabled())
>>>>>> + order = 0;
>>>>>> + else {
>>>>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>>>>>> +
>>>>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>>>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
>>>>>> + }
>>>>>> +
>>>>>> + return order;
>>>>>> +}
>>>>> Hi All,
>>>>> I'm writing up the conclusions that we arrived at during discussion in the THP
>>>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
>>>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and
>>>>> Kirill.
>>>>> In summary; I think we are converging on the approach that is already coded, but
>>>>> I'd like confirmation.
>>>>> The THP situation today
>>>>> -----------------------
>>>>> - At system level: THP can be set to "never", "madvise" or "always"
>>>>> - At process level: THP can be "never" or "defer to system setting"
>>>>> - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>>>>> That gives us this table to describe how a page fault is handled, according to
>>>>> process state (columns) and vma flags (rows):
>>>>> | never | madvise | always
>>>>> ----------------|-----------|-----------|-----------
>>>>> no hint | S | S | THP>S
>>>>> Legend:
>>>>> S allocate single page (PTE-mapped)
>>>>> LAF allocate lage anon folio (PTE-mapped)
>>>>> THP allocate THP-sized folio (PMD-mapped)
>>>>>> fallback (usually because vma size/alignment insufficient for folio)
>>>>> Principles for Large Anon Folios (LAF)
>>>>> --------------------------------------
>>>>> David tells us there are use cases today (e.g. qemu live migration) which use
>>>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
>>>>> and these use cases will break (i.e. functionally incorrect) if this request is
>>>>> not honoured.
>>>>> So LAF must at least honour MADV_NOHUGEPAGE to prevent breaking existing use
>>>>> cases. And once we do this, then I think the least confusing thing is for it to
>>>>> also honor the "never" system/process state; so if either the system, process or
>>>>> vma has explicitly opted-out of THP, then LAF should also be bypassed.
>>>>> Similarly, any case that would previously cause the allocation of PMD-sized THP
>>>>> must continue to be honoured, else we risk performance regression.
>>>>> That leaves the "madvise/no-hint" case, and all THP fallback paths due to the
>>>>> VMA not being correctly aligned or sized to hold a PMD-sized mapping. In these
>>>>> cases, we will attempt to use LAF first, and fallback to single page if the vma
>>>>> size/alignment doesn't permit it.
>>>>> | never | madvise | always
>>>>> ----------------|-----------|-----------|-----------
>>>>> no hint | S | LAF>S | THP>LAF>S
>>>>> I think this (perhaps conservative) approach will be the least surprising to
>>>>> users. And is the policy that is already implemented in this patch.
>>>> This looks very reasonable.
>>>> The only questionable field is no-hint/madvise. I can argue for both LAF>S
>>>> and S here. I think LAF>S is fine as long as we are not too aggressive
>>>> with allocation order.
>>>> I think we need to work on eliminating reasons for users to set 'never'.
>>>> If something behaves better with 'never' kernel has failed user.
>>>>> Downsides of this policy
>>>>> ------------------------
>>>>> As Yu and Yin have pointed out, there are some workloads which do not perform
>>>>> well with THP, due to large fault latency or memory wastage, etc. But which
>>>>> _may_ still benefit from LAF. By taking the conservative approach, we exclude
>>>>> these workloads from benefiting automatically.
>>>> Hm. I don't buy it. Why THP with order-9 is too much, but order-8 LAF is
>>>> fine?
>>> No, it's not. And no one said order-8 LAF is fine :) The starting
>>> order for LAF that we have been discussing is at most 64KB (vs 2MB
>>> THP). For my taste, it's still too large. I'd go with 32KB/16KB.
>> Its currently influenced by the arch. If the arch doesn't have an opinion then
>> its currently 32K in the code. The 64K size is my aspiration for arm64 if/when I
>> land the contpte mapping work.
> Just to double check: this discussion covers the long term/permanente
> solution/roadmap, correct? That's what Kirill and I were arguing
> about. Otherwise, the order-8/9 concern above is totally irrelevant,
> since we don't have them in this series.
> For the short term (this series), what you described above looks good
> to me: we may regress but will not break any existing use cases, and
> we are behind a Kconfig option.

OK that's good to hear.

>>> However, the same argument can be used to argue against the policy
>>> Ryan listed above: why order-10 LAF is ok for madvise but not order-11
>>> (which becomes "always")?
>> Sorry I don't understand what you are saying here. Where has order-10 LAF come from?
> I pushed that rhetoric a bit further: order-11 is the THP size (32MB)
> with 16KB base page size on ARM. Confusing, isn't it? And there is
> another complaint from Fengwei here [1].
> [1]
>>> I'm strongly against this policy
> Again, just to be clear: I'm strongly against this policy to be
> exposed to userspace in any way and become a long-term/permanent thing
> we have to maintain/change in the future, since I'm assuming that's
> the context.

I'm still confused. The policy I described (and which I thought we were
discussing) does not expose any new tunables to user space. And you said above
that what I described "looks good to me". So is "this policy" which you are
strongly against referring to the policy I wrote down or something else?

>> Ugh, I thought we came to an agreement (or at least "disagree and commit") on
>> the THP call. Obviously I was wrong.
> My impression is we only agreed on one thing: at the current stage, we
> should respect things we absolutely have to. We didn't agree on what
> "never" means ("never 2MB" or "never >4KB"), and we didn't touch on
> how "always" should behave at all.

I _think_ we have now agreed some of this in other threads, but please re-raise
in the context of the other email thread I just sent out - its probably cleaner
to continue discussion there.

>> David is telling us that we will break user space if we don't consider
>> MADV_NOHUGEPAGE to mean "never allocate memory to unfaulted addresses". So tying
>> to at least this must be cast in stone, no? Could you lay out any policy
>> proposal you have as an alternative that still follows this requirement?
> If MADV_NOHUGEPAGE falls into the category of things we have to
> absolutely respect, then we will. But I don't think it does, because
> the UFFD check we have in this series already guarantees the KVM use
> case. I can explain how it works in detail if it's still not clear to
> you: long story short, the UFFD check precedes the MADV_NOHUGEPAGE
> check in alloc_anon_folio().

I think we have now concluded that its not this simple; MADV_NOHUGEPAGE is
applied to the region and pages are faulted in before UFFD is registered, so
checking for UFFD is not sufficient.

> Here is what I recommend for the medium and long terms:
> For the short term, hard-coding two orders (hw/sw preferred), putting
> them behind a Kconfig and not exposing this info to the userspace are
> good enough for me.

I think that's pretty much what I have now, so perhaps I'm laboring the point a
bit too much here. Let's just get the prerequisites ticked off then get this
patch set merged??