Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6

From: Mel Gorman
Date: Wed Apr 19 2017 - 04:15:51 EST

Next message: Michal Hocko: "Re: Re: Re: "mm: move pcp and lru-pcp draining into single wq" broke resume from s2ram"
Previous message: Huang\, Ying: "Re: [PATCH -mm -v3] mm, swap: Sort swap entries before free"
Next in thread: Rafael J. Wysocki: "Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Apr 14, 2017 at 04:01:40PM -0700, Doug Smythies wrote:
> Hi Mel,
>
> Thanks for the "how to" information.
> This is a very interesting use case.
> From trace data, I see a lot of minimal durations with
> virtually no load on the CPU, typically more consistent
> with some type of light duty periodic (~~100 Hz) work flow
> (where we would prefer to not ramp up frequencies, or more
> accurately keep them ramped up).

This broadly matches my expectations in terms of behaviour. It is a
low duty workload but while I accept that a laptop may not want the
frequencies to ramp up, it's not universally true. Long periods at low
frequency to complete a workload is not necessarily better than using a
high frequency to race to idle. Effectively, a low utilisation test suite
could be considered as a "foreground task of high priority" and not a
"background task of little interest".

> My results (further below) are different than yours, sometimes
> dramatically, but the trends are similar.

It's inevitable there would be some hardware based differences. The
machine I have appears to show an extreme case.

> I have nothing to add about the control algorithm over what
> Rafael already said.
>
> On 2017.04.11 09:42 Mel Gorman wrote:
> > On Tue, Apr 11, 2017 at 08:41:09AM -0700, Doug Smythies wrote:
> >> On 2017.04.11 03:03 Mel Gorman wrote:
> >>>On Mon, Apr 10, 2017 at 10:51:38PM +0200, Rafael J. Wysocki wrote:
> >>>> On Mon, Apr 10, 2017 at 10:41 AM, Mel Gorman wrote:
> >>>>>
> >>>>> It's far more obvious when looking at the git test suite and the length
> >>>>> of time it takes to run. This is a shellscript and git intensive workload
> >>>>> whose CPU utilisatiion is very low but is less sensitive to multiple
> >>>>> factors than netperf and sockperf.
> >>>>
> >>
> >> I would like to repeat your tests on my test computer (i7-2600K).
> >> I am not familiar with, and have not been able to find,
> >> "the git test suite" shellscript. Could you point me to it?
> >>
> >
> > If you want to use git source directly do a checkout from
> > https://github.com/git/git and build it. The core "benchmark" is make
> > test and timing it.
>
> Because I had troubles with your method further below, I also did
> this method. I did 5 runs, after a throw away run, similar to what
> you do (and I could see the need for a throw away pass).
>

Yeah, at the very least IO effects should be eliminated.

> Results (there is something wrong with user and system times and CPU%
> in kernel 4.5, so I only calculated Elapsed differences):
>

In case it matters, the User and System CPU times are reported as standard
for these classes of workload by mmtests even though it's not necessarily
universally interesting. Generally, I consider the elapsed time to
be the most important but often, a major change in system CPU time is
interesting. That's not universally true as there have been changes in how
system CPU is calculated in the past and it's sensitive to Kconfig options
with VIRT_CPU_ACCOUNTING_GEN being a notable source of confusion in the past.

> Linux s15 4.5.0-stock #232 SMP Tue Apr 11 23:54:49 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
> ... test_run: start 5 runs ...
> 327.04user 122.08system 33:57.81elapsed (2037.81 : reference) 22%CPU
> ... test_run: done ...
>
> Linux s15 4.11.0-rc6-stock #231 SMP Mon Apr 10 08:29:29 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
>
> intel_pstate - powersave
> ... test_run: start 5 runs ...
> 1518.71user 552.87system 39:24.45elapsed (2364.45 : -16.03%) 87%CPU
> ... test_run: done ...
>
> intel_pstate - performance (fast reference)
> ... test_run: start 5 runs ...
> 1160.52user 291.33system 29:36.05elapsed (1776.05 : 12.85%) 81%CPU
> ... test_run: done ...
>
> intel_cpufreq - powersave (slow reference)
> ... test_run: start 5 runs ...
> 2165.72user 1049.18system 57:12.77elapsed (3432.77 : -68.45%) 93%CPU
> ... test_run: done ...
>
> intel_cpufreq - ondemand
> ... test_run: start 5 runs ...
> 1776.79user 808.65system 47:14.74elapsed (2834.74 : -39.11%) 91%CPU
>

Nothing overly surprising there. It's been my observation that pstate is
generally better than acpi_cpufreq which somewhat amuses me when I still
see suggestions of disabling intel_pstate entirely being used despite the
advice being based on much older kernels.

> intel_cpufreq - schedutil
> ... test_run: start 5 runs ...
> 2049.28user 1028.70system 54:57.82elapsed (3297.82 : -61.83%) 93%CPU
> ... test_run: done ...
>

I'm mildly surprised at this. I had observed that schedutil is not great
but I don't recall seeing a result this bad.

> Linux s15 4.11.0-rc6-revert #233 SMP Wed Apr 12 15:30:19 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux
> ... test_run: start 5 runs ...
> 1295.30user 365.98system 32:50.15elapsed (1970.15 : 3.32%) 84%CPU
> ... test_run: done ...
>

And the revert does help albeit not being an option for reasons Rafael
covered.

> > The way I'm doing it is via mmtests so
> >
> > git clone https://github.com/gormanm/mmtests
> > cd mmtests
> > ./run-mmtests --no-monitor --config configs/config-global-dhp__workload_shellscripts test-run-1
> > cd work/log
> > ../../compare-kernels.sh | less
> >
> > and it'll generate a similar report to what I posted in this email
> > thread. If you do multiple tests with different kernels then change the
> > name of "test-run-1" to preserve the old data. compare-kernel.sh will
> > compare whatever results you have.
>
> k4.5 k4.11-rc6 k4.11-rc6 k4.11-rc6 k4.11-rc6 k4.11-rc6 k4.11-rc6
> performance pass-ps pass-od pass-su revert
> E min 388.71 456.51 (-17.44%) 342.81 ( 11.81%) 668.79 (-72.05%) 552.85 (-42.23%) 646.96 (-66.44%) 375.08 ( 3.51%)
> E mean 389.74 458.52 (-17.65%) 343.81 ( 11.78%) 669.42 (-71.76%) 553.45 (-42.01%) 647.95 (-66.25%) 375.98 ( 3.53%)
> E stddev 0.85 1.64 (-92.78%) 0.67 ( 20.83%) 0.41 ( 52.25%) 0.31 ( 64.00%) 0.68 ( 20.35%) 0.46 ( 46.00%)
> E coeffvar 0.22 0.36 (-63.86%) 0.20 ( 10.25%) 0.06 ( 72.20%) 0.06 ( 74.65%) 0.10 ( 52.09%) 0.12 ( 44.03%)
> E max 390.90 461.47 (-18.05%) 344.83 ( 11.79%) 669.91 (-71.38%) 553.68 (-41.64%) 648.75 (-65.96%) 376.37 ( 3.72%)
>
> E = Elapsed (squished in an attempt to prevent line length wrapping when I send)
>
> k4.5 k4.11-rc6 k4.11-rc6 k4.11-rc6 k4.11-rc6 k4.11-rc6 k4.11-rc6
> performance pass-ps pass-od pass-su revert
> User 347.26 1801.56 1398.76 2540.67 2106.30 2434.06 1536.80
> System 139.01 701.87 366.59 1346.75 1026.67 1322.39 449.81
> Elapsed 2346.77 2761.20 2062.12 4017.47 3321.10 3887.19 2268.90
>
> Legend:
> blank = active mode: intel_pstate - powersave
> performance = active mode: intel_pstate - performance (fast reference)
> pass-ps = passive mode: intel_cpufreq - powersave (slow reference)
> pass-od = passive mode: intel_cpufreq - ondemand
> pass-su = passive mode: intel_cpufreq - schedutil
> revert = active mode: intel_pstate - powersave with commit ffb810563c0c reverted.
>
> I deleted the user, system, and CPU rows, because they don't make any sense.
>

User is particularly misleading. System can be very misleading between
kernel versions due to accounting differences so I'm ok with that.

> I do not know why the tests run overall so much faster on my computer,

Differences in CPU I imagine. I know the machine I'm reporting on is a
particularly bad example. I've seen other machines where the effect is
less severe.

> I can only assume I have something wrong in my installation of your mmtests.

No, I've seen results broadly similar to yours on other machines so I
don't think you have a methodology error.

> I do see mmtests looking for some packages which it can not find.
>

That's not too unusual. The package names are based on opensuse naming
and that doesn't translate to other distributions. If you open
bin/install-depends, you'll see a hashmap near the top that maps some of
the names for redhat-based distributions and debian. It's not actively
maintained. You can either install the packages manaually before the
test or update the mappings.

> Mel wrote:
> > The results show that it's not the only source as a revert (last column)
> > doesn't fix the damage although it goes from 3750 seconds (4.11-rc5 vanilla)
> > to 2919 seconds (with a revert).
>
> In my case, the reverted code ran faster than the kernel 4.5 code.
>
> The other big difference is between Kernel 4.5 and 4.11-rc5 you got
> -102.28% elapsed time, whereas I got -16.03% with method 1 and
> -17.65% with method 2 (well, between 4.5 and 4.11-rc6 in my case).
> I only get -93.28% and -94.82% difference between my fast and slow reference
> tests (albeit on the same kernel).
>

I have no reason to believe this is a methodology error and is due to a
difference in CPU. Consider the following reports

http://beta.suse.com/private/mgorman/results/home/marvin/openSUSE-LEAP-42.2/global-dhp__workload_shellscripts-xfs/delboy/#gitsource
http://beta.suse.com/private/mgorman/results/home/marvin/openSUSE-LEAP-42.2/global-dhp__workload_shellscripts-xfs/ivy/#gitsource

The first one (delboy) shows a gain of 1.35% and it's only for 4.11
(kernel shown is 4.11-rc1 with vmscan-related patches on top that do not
affect this test case) of -17.51% which is very similar to yours. The
CPU there is a Xeon E3-1230 v5.

The second report (ivy) is the machine I'm based the original complain
on and shows the large regression in elapsed time.

So, different CPUs have different behaviours which is no surprise at all
considering that at the very least, exit latencies will be different.
While there may not be a universally correct answer to how to do this
automatically, is it possible to tune intel_pstate such that it ramps up
quickly regardless of recent utilisation and reduces relatively slowly?
That would be better from a power consumption perspective than setting the
"performance" governor.

Thanks.

--
Mel Gorman
SUSE Labs

Next message: Michal Hocko: "Re: Re: Re: "mm: move pcp and lru-pcp draining into single wq" broke resume from s2ram"
Previous message: Huang\, Ying: "Re: [PATCH -mm -v3] mm, swap: Sort swap entries before free"
Next in thread: Rafael J. Wysocki: "Re: Performance of low-cpu utilisation benchmark regressed severely since 4.6"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]