Re: alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9

From: Florian Fainelli
Date: Sun May 16 2021 - 12:13:54 EST

Next message: Bodo Stroesser: "Re: [PATCH] scsi: target: tcmu: fix boolreturn.cocci warnings"
Previous message: Dmitry Osipenko: "[PATCH v2 4/4] memory: tegra: Enable compile testing for all drivers"
Next in thread: David Hildenbrand: "Re: alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 4/22/2021 12:31 PM, Florian Fainelli wrote:
>> For
>>
>> https://lkml.kernel.org/r/20210121175502.274391-3-minchan@xxxxxxxxxx
>>
>> to do its work you'll have to pass __GFP_NORETRY to
>> alloc_contig_range(). This requires CMA adaptions, from where we call
>> alloc_contig_range().
>
> Yes, I did modify the alloc_contig_range() caller to pass GFP_KERNEL |
> __GFP_NORETRY. I did run for a more iterations (1000) and the results
> are not very conclusive as with __GFP_NORETRY the allocation time per
> allocation was not significantly better, in fact it was slightly worse
> by 100us than without.
>
> My x86 VM with 1GB of DRAM including 512MB being in ZONE_MOVABLE does
> shows identical numbers for both 4.9 and 5.4 so this must be something
> specific to ARM64 and/or the code we added to create a ZONE_MOVABLE on
> that architecture since movablecore does not appear to have any effect
> unlike x86.

We tracked down the slowdowns to be caused by two major contributors:

- for a reason that we do not fully understand yet the same cpufreq
governor (conservative) did not cause alloc_contig_range() to be slowed
down on 4.9 as much as it it with 5.4, running tests with the
performance cpufreq governor works a tad better and the results are more
consistent from run to run with a smaller variation.

- another large contributor to the slowdown was having enabled
CONFIG_IRQSOFF_TRACER. After c3bc8fd637a9623f5c507bd18f9677effbddf584
("tracing: Centralize preemptirq tracepoints and unify their usage") we
now prepare arguments for tracing even if we end-up not using them since
tracing is not enabled at runtime. Getting the caller function's return
address is cheap on arm64 for level == 0, but getting the preceding
caller involves doing a backtrace walk which is expensive (see
arch/arm64/kernel/return_address.c).

So with these two variables eliminated we are only about x2 slower on
5.4 than we were on 4.9 and this is acceptable for our use case. I would
not say the case is closed but at least we understand it better. We now
have 5.10 brought up to speed so any new investigation will be focused
on that kernel.

Thanks a lot for your help David!
--
Florian

Next message: Bodo Stroesser: "Re: [PATCH] scsi: target: tcmu: fix boolreturn.cocci warnings"
Previous message: Dmitry Osipenko: "[PATCH v2 4/4] memory: tegra: Enable compile testing for all drivers"
Next in thread: David Hildenbrand: "Re: alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]