Re: [RFC/RFT] [PATCH v3 0/4] Intel_pstate: HWP Dynamic performance boost

From: Giovanni Gherdovich
Date: Mon Jun 04 2018 - 13:58:48 EST


On Thu, May 31, 2018 at 03:51:39PM -0700, Srinivas Pandruvada wrote:
> v3
> - Removed atomic bit operation as suggested.
> - Added description of contention with user space.
> - Removed hwp cache, boost utililty function patch and merged with util callback
> patch. This way any value set is used somewhere.
>
> Waiting for test results from Mel Gorman, who is the original reporter.
> [SNIP]

Tested-by: Giovanni Gherdovich <ggherdovich@xxxxxxx>

This series has an overall positive performance impact on IO both on xfs and
ext4, and I'd be vary happy if it lands in v4.18. You dropped the migration
optimization from v1 to v2 after the reviewers' suggestion; I'm looking
forward to test that part too, so please add me to CC when you'll resend it.

I've tested your series on a single socket Xeon E3-1240 v5 (Skylake, 4 cores /
8 threads) with SSD storage. The platform is a Dell PowerEdge R230.

The benchmarks used are a mix of I/O intensive workloads on ext4 and xfs
(dbench4, sqlite, pgbench in read/write and read-only configuration, Flexible
IO aka FIO, etc) and scheduler stressers just to check that everything is okay
in that department too (hackbench, pipetest, schbench, sockperf on localhost
both in "throughput" and "under-load" mode, netperf in localhost, etc). There
is also some HPC with the NAS Parallel Benchmark, as when using openMPI as IPC
mechanism it ends up being write-intensive and that could be a good
experiment, even if the HPC people aren't exactly the target audience for a
frequency governor.

The large improvements are in areas you already highlighted in your cover
letter (dbench4, sqlite, and pgbench read/write too, very impressive
honestly). Minor wins are also observed in sockperf and running the git unit
tests (gitsource below). The scheduler stressers ends up, as expected, in the
"neutral" category where you'll also find FIO (which given other results I'd
have expected to improve a little at least). Marked "neutral" are also those
results where statistical significance wasn't reached (2 standard deviations,
which is roughly like a 0.05 p-value) even if they showed some difference in a
direction or the other. In the "small losses" section I found hackbench run
with processes (not threads) and pipes (not sockets) which I report for due
diligence but looking at the raw numbers it's more of a mixed bag than a real
loss, and the NAS high-perf computing benchmark when it uses openMP (as
opposed to openMPI) for IPC -- but again, we often find that supercomputers
people run the machines at full speed all the time.

At the bottom of this message you'll find some directions if you want to run
some test yourself using the same framework I used, MMTests from
https://github.com/gormanm/mmtests (we store a fair amount of benchmarks
parametrization up there).

Large wins:

- dbench4: +20% on ext4,
+14% on xfs (always asynch IO)
- sqlite (insert): +9% on both ext4 and xfs
- pgbench (read/write): +9% on ext4,
+10% on xfs

Moderate wins:

- sockperf (type: under-load, localhost): +1% with TCP,
+5% with UDP
- gisource (git unit tests, shell intensive): +3% on ext4
- NAS Parallel Benchmark (HPC, using openMPI, on xfs): +1%
- tbench4 (network part of dbench4, localhost): +1%

Neutral:

- pgbench (read-only) on ext4 and xfs
- siege
- netperf (streaming and round-robin) with TCP and UDP
- hackbench (sockets/process, sockets/thread and pipes/thread)
- pipetest
- Linux kernel build
- schbench
- sockperf (type: throughput) with TCP and UDP
- git unit tests on xfs
- FIO (both random and seq. read, both random and seq. write)
on ext4 and xfs, async IO

Moderate losses:

- hackbench (pipes/process): -10%
- NAS Parallel Benchmark with openMP: -1%


Each benchmark is run with a variety of configuration parameters (eg: number
of threads, number of clients, etc); to reach a final "score" the geometric
mean is used (with a few exceptions depending on the type of benchmark).
Detailed results follow. Amean, Hmean and Gmean are respectively arithmetic,
harmonic and geometric means.

For brevity I won't report all tables but only those for "large wins" and
"moderate losses". Note that I'm not overly worried for the hackbench-pipes
situation, as we've studied it in the past and determined that such
configuration is particularly weak, time is mostly spent on contention and the
scheduler code path isn't exercised. See the comment in the file
configs/config-global-dhp__scheduler-unbound in MMTests for a brief
description of the issue.

DBENCH4
=======

NOTES: asyncronous IO; varies the number of clients up to NUMCPUS*8.
MMTESTS CONFIG: global-dhp__io-dbench4-async-{ext4, xfs}
MEASURES: latency (millisecs)
LOWER is better

EXT4
4.16.0 4.16.0
vanilla hwp-boost
Amean 1 28.49 ( 0.00%) 19.68 ( 30.92%)
Amean 2 26.70 ( 0.00%) 25.59 ( 4.14%)
Amean 4 54.59 ( 0.00%) 43.56 ( 20.20%)
Amean 8 91.19 ( 0.00%) 77.56 ( 14.96%)
Amean 64 538.09 ( 0.00%) 438.67 ( 18.48%)
Stddev 1 6.70 ( 0.00%) 3.24 ( 51.66%)
Stddev 2 4.35 ( 0.00%) 3.57 ( 17.85%)
Stddev 4 7.99 ( 0.00%) 7.24 ( 9.29%)
Stddev 8 17.51 ( 0.00%) 15.80 ( 9.78%)
Stddev 64 49.54 ( 0.00%) 46.98 ( 5.17%)

XFS
4.16.0 4.16.0
vanilla hwp-boost
Amean 1 21.88 ( 0.00%) 16.03 ( 26.75%)
Amean 2 19.72 ( 0.00%) 19.82 ( -0.50%)
Amean 4 37.55 ( 0.00%) 29.52 ( 21.38%)
Amean 8 56.73 ( 0.00%) 51.83 ( 8.63%)
Amean 64 808.80 ( 0.00%) 698.12 ( 13.68%)
Stddev 1 6.29 ( 0.00%) 2.33 ( 62.99%)
Stddev 2 3.12 ( 0.00%) 2.26 ( 27.73%)
Stddev 4 7.56 ( 0.00%) 5.88 ( 22.28%)
Stddev 8 14.15 ( 0.00%) 12.49 ( 11.71%)
Stddev 64 380.54 ( 0.00%) 367.88 ( 3.33%)

SQLITE
======

NOTES: SQL insert test on a table that will be 2M in size.
MMTESTS CONFIG: global-dhp__db-sqlite-insert-medium-{ext4, xfs}
MEASURES: transactions per second
HIGHER is better

EXT4
4.16.0 4.16.0
vanilla hwp-boost
Hmean Trans 2098.79 ( 0.00%) 2292.16 ( 9.21%)
Stddev Trans 78.79 ( 0.00%) 95.73 ( -21.50%)

XFS
4.16.0 4.16.0
vanilla hwp-boost
Hmean Trans 1890.27 ( 0.00%) 2058.62 ( 8.91%)
Stddev Trans 52.54 ( 0.00%) 29.56 ( 43.73%)

PGBENCH-RW
==========

NOTES: packaged with Postgres. Varies the number of thread up to NUMCPUS. The
workload is scaled so that the approximate size is 80% of of the database
shared buffer which itself is 20% of RAM. The page cache is not flushed
after the database is populated for the test and starts cache-hot.
MMTESTS CONFIG: global-dhp__db-pgbench-timed-rw-small-{ext4, xfs}
MEASURES: transactions per second
HIGHER is better

EXT4
4.16.0 4.16.0
vanilla hwp-boost
Hmean 1 2692.19 ( 0.00%) 2660.98 ( -1.16%)
Hmean 4 5218.93 ( 0.00%) 5610.10 ( 7.50%)
Hmean 7 7332.68 ( 0.00%) 8378.24 ( 14.26%)
Hmean 8 7462.03 ( 0.00%) 8713.36 ( 16.77%)
Stddev 1 231.85 ( 0.00%) 257.49 ( -11.06%)
Stddev 4 681.11 ( 0.00%) 312.64 ( 54.10%)
Stddev 7 1072.07 ( 0.00%) 730.29 ( 31.88%)
Stddev 8 1472.77 ( 0.00%) 1057.34 ( 28.21%)

XFS
4.16.0 4.16.0
vanilla hwp-boost
Hmean 1 2675.02 ( 0.00%) 2661.69 ( -0.50%)
Hmean 4 5049.45 ( 0.00%) 5601.45 ( 10.93%)
Hmean 7 7302.18 ( 0.00%) 8348.16 ( 14.32%)
Hmean 8 7596.83 ( 0.00%) 8693.29 ( 14.43%)
Stddev 1 225.41 ( 0.00%) 246.74 ( -9.46%)
Stddev 4 761.33 ( 0.00%) 334.77 ( 56.03%)
Stddev 7 1093.93 ( 0.00%) 811.30 ( 25.84%)
Stddev 8 1465.06 ( 0.00%) 1118.81 ( 23.63%)

HACKBENCH
=========

NOTES: Varies the number of groups between 1 and NUMCPUS*4
MMTESTS CONFIG: global-dhp__scheduler-unbound
MEASURES: time (seconds)
LOWER is better

4.16.0 4.16.0
vanilla hwp-boost
Amean 1 0.8350 ( 0.00%) 1.1577 ( -38.64%)
Amean 3 2.8367 ( 0.00%) 3.7457 ( -32.04%)
Amean 5 6.7503 ( 0.00%) 5.7977 ( 14.11%)
Amean 7 7.8290 ( 0.00%) 8.0343 ( -2.62%)
Amean 12 11.0560 ( 0.00%) 11.9673 ( -8.24%)
Amean 18 15.2603 ( 0.00%) 15.5247 ( -1.73%)
Amean 24 17.0283 ( 0.00%) 17.9047 ( -5.15%)
Amean 30 19.9193 ( 0.00%) 23.4670 ( -17.81%)
Amean 32 21.4637 ( 0.00%) 23.4097 ( -9.07%)
Stddev 1 0.0636 ( 0.00%) 0.0255 ( 59.93%)
Stddev 3 0.1188 ( 0.00%) 0.0235 ( 80.22%)
Stddev 5 0.0755 ( 0.00%) 0.1398 ( -85.13%)
Stddev 7 0.2778 ( 0.00%) 0.1634 ( 41.17%)
Stddev 12 0.5785 ( 0.00%) 0.1030 ( 82.19%)
Stddev 18 1.2099 ( 0.00%) 0.7986 ( 33.99%)
Stddev 24 0.2057 ( 0.00%) 0.7030 (-241.72%)
Stddev 30 1.1303 ( 0.00%) 0.7654 ( 32.28%)
Stddev 32 0.2032 ( 0.00%) 3.1626 (-1456.69%)

NAS PARALLEL BENCHMARK, C-CLASS (w/ openMP)
===========================================

NOTES: The various computational kernels are run separately; see
https://www.nas.nasa.gov/publications/npb.html for the list of tasks (IS =
Integer Sort, EP = Embarrassingly Parallel, etc)
MMTESTS CONFIG: global-dhp__nas-c-class-omp-full
MEASURES: time (seconds)
LOWER is better

4.16.0 4.16.0
vanilla hwp-boost
Amean bt.C 169.82 ( 0.00%) 170.54 ( -0.42%)
Stddev bt.C 1.07 ( 0.00%) 0.97 ( 9.34%)
Amean cg.C 41.81 ( 0.00%) 42.08 ( -0.65%)
Stddev cg.C 0.06 ( 0.00%) 0.03 ( 48.24%)
Amean ep.C 26.63 ( 0.00%) 26.47 ( 0.61%)
Stddev ep.C 0.37 ( 0.00%) 0.24 ( 35.35%)
Amean ft.C 38.17 ( 0.00%) 38.41 ( -0.64%)
Stddev ft.C 0.33 ( 0.00%) 0.32 ( 3.78%)
Amean is.C 1.49 ( 0.00%) 1.40 ( 6.02%)
Stddev is.C 0.20 ( 0.00%) 0.16 ( 19.40%)
Amean lu.C 217.46 ( 0.00%) 220.21 ( -1.26%)
Stddev lu.C 0.23 ( 0.00%) 0.22 ( 0.74%)
Amean mg.C 18.56 ( 0.00%) 18.80 ( -1.31%)
Stddev mg.C 0.01 ( 0.00%) 0.01 ( 22.54%)
Amean sp.C 293.25 ( 0.00%) 296.73 ( -1.19%)
Stddev sp.C 0.10 ( 0.00%) 0.06 ( 42.67%)
Amean ua.C 170.74 ( 0.00%) 172.02 ( -0.75%)
Stddev ua.C 0.28 ( 0.00%) 0.31 ( -12.89%)

HOW TO REPRODUCE
================

To install MMTests, clone the git repo at
https://github.com/gormanm/mmtests.git

To run a config (ie a set of benchmarks, such as
config-global-dhp__nas-c-class-omp-full), use the command
./run-mmtests.sh --config configs/$CONFIG $MNEMONIC-NAME
from the top-level directory; the benchmark source will be downloaded from its
canonical internet location, compiled and run.

To compare results from two runs, use
./bin/compare-mmtests.pl --directory ./work/log \
--benchmark $BENCHMARK-NAME \
--names $MNEMONIC-NAME-1,$MNEMONIC-NAME-2
from the top-level directory.



Thanks,
Giovanni Gherdovich
SUSE Labs