Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory

From: Yu Zhao
Date: Tue Jul 04 2023 - 20:22:09 EST


On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote:
>
>
>
> On 7/4/23 23:36, Ryan Roberts wrote:
> > On 04/07/2023 08:11, Yu Zhao wrote:
> >> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@xxxxxxxxx> wrote:
> >>>
> >>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
> >>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> This is v2 of a series to implement variable order, large folios for anonymous
> >>>>> memory. The objective of this is to improve performance by allocating larger
> >>>>> chunks of memory during anonymous page faults. See [1] for background.
> >>>>
> >>>> Thanks for the quick response!
> >>>>
> >>>>> I've significantly reworked and simplified the patch set based on comments from
> >>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> >>>>> VARIABLE_THP, on Yu's advice.
> >>>>>
> >>>>> The last patch is for arm64 to explicitly override the default
> >>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
> >>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
> >>>>> could be handled through the arm64 tree separately. Neither has any build
> >>>>> dependency on the other.
> >>>>>
> >>>>> The one area where I haven't followed Yu's advice is in the determination of the
> >>>>> size of folio to use. It was suggested that I have a single preferred large
> >>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> >>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
> >>>>> order-0. It turned out that this approach caused a performance regression in the
> >>>>> Speedometer benchmark.
> >>>>
> >>>> I suppose it's regression against the v1, not the unpatched kernel.
> >>> From the performance data Ryan shared, it's against unpatched kernel:
> >>>
> >>> Speedometer 2.0:
> >>>
> >>> | kernel | runs_per_min |
> >>> |:-------------------------------|---------------:|
> >>> | baseline-4k | 0.0% |
> >>> | anonfolio-lkml-v1 | 0.7% |
> >>> | anonfolio-lkml-v2-simple-order | -0.9% |
> >>> | anonfolio-lkml-v2 | 0.5% |
> >>
> >> I see. Thanks.
> >>
> >> A couple of questions:
> >> 1. Do we have a stddev?
> >
> > | kernel | mean_abs | std_abs | mean_rel | std_rel |
> > |:------------------------- |-----------:|----------:|-----------:|----------:|
> > | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% |
> > | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% |
> > | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% |
> > | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% |
> >
> > This is with 3 runs per reboot across 5 reboots, with first run after reboot
> > trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
> > points per kernel in total.
> >
> > I've rerun the test multiple times and see similar results each time.
> >
> > I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
> > see the same performance as baseline-4k.
> >
> >
> >> 2. Do we have a theory why it regressed?
> >
> > I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
> > mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
> > to order-0. I guess this is happening so often for this workload that the cost
> > of doing the checks and fallback is outweighing the benefit of the memory that
> > does end up with order-4 folios.
> >
> > I've sampled the memory in each bucket (once per second) while running and its
> > roughly:
> >
> > 64K: 25%
> > 32K: 15%
> > 16K: 15%
> > 4K: 45%
> >
> > 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
> > But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
> > the 64K contents is more static - that's just a guess though.
> So this is like out of vma range thing.
>
> >
> >> Assuming no bugs, I don't see how a real regression could happen --
> >> falling back to order-0 isn't different from the original behavior.
> >> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?
> >
> > I can, but it will have to be a bit later in the week. I'll do some more test
> > runs overnight so we have a larger number of runs - hopefully that might tell us
> > that this is noise to a certain extent.
> >
> > I'd still like to hear a clear technical argument for why the bin-packing
> > approach is not the correct one!
> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we
> postpone this part of change and make basic anon large folio support in. Then
> discuss which approach we should take. Maybe people will agree retry is the
> choice, maybe other approach will be taken...
>
> For example, for this out of VMA range case, per VMA order should be considered.
> We don't need make decision that the retry should be taken now.

I've articulated the reasons in another email. Just summarize the most
important point here:
using more fallback orders makes a system reach equilibrium faster, at
which point it can't allocate the order of arch_wants_pte_order()
anymore. IOW, this best-fit policy can reduce the number of folios of
the h/w prefered order for a system running long enough.