Re: alloc_contig_range() with MIGRATE_MOVABLE performance regression since 4.9

From: David Hildenbrand
Date: Mon May 17 2021 - 03:46:19 EST


On 16.05.21 18:13, Florian Fainelli wrote:


On 4/22/2021 12:31 PM, Florian Fainelli wrote:
For

https://lkml.kernel.org/r/20210121175502.274391-3-minchan@xxxxxxxxxx

to do its work you'll have to pass  __GFP_NORETRY to
alloc_contig_range(). This requires CMA adaptions, from where we call
alloc_contig_range().

Yes, I did modify the alloc_contig_range() caller to pass GFP_KERNEL |
__GFP_NORETRY. I did run for a more iterations (1000) and the results
are not very conclusive as with __GFP_NORETRY the allocation time per
allocation was not significantly better, in fact it was slightly worse
by 100us than without.

My x86 VM with 1GB of DRAM including 512MB being in ZONE_MOVABLE does
shows identical numbers for both 4.9 and 5.4 so this must be something
specific to ARM64 and/or the code we added to create a ZONE_MOVABLE on
that architecture since movablecore does not appear to have any effect
unlike x86.

We tracked down the slowdowns to be caused by two major contributors:

- for a reason that we do not fully understand yet the same cpufreq
governor (conservative) did not cause alloc_contig_range() to be slowed
down on 4.9 as much as it it with 5.4, running tests with the
performance cpufreq governor works a tad better and the results are more
consistent from run to run with a smaller variation.

Interesting! So your CPU is down-clocking while performing (heavy) kernel work? Is that expected or are we mis-accounting kernel cpu time somehow when it comes to determining the CPU target frequency?


- another large contributor to the slowdown was having enabled
CONFIG_IRQSOFF_TRACER. After c3bc8fd637a9623f5c507bd18f9677effbddf584
("tracing: Centralize preemptirq tracepoints and unify their usage") we
now prepare arguments for tracing even if we end-up not using them since
tracing is not enabled at runtime. Getting the caller function's return
address is cheap on arm64 for level == 0, but getting the preceding
caller involves doing a backtrace walk which is expensive (see
arch/arm64/kernel/return_address.c).

Again, very interesting finding.


So with these two variables eliminated we are only about x2 slower on
5.4 than we were on 4.9 and this is acceptable for our use case. I would
not say the case is closed but at least we understand it better. We now
have 5.10 brought up to speed so any new investigation will be focused
on that kernel.


Thanks for the insight, please do let me know when you learn more. x2 slowdown still is quite a lot.

--
Thanks,

David / dhildenb