Re: [RFC PATCH v3 00/16] Core scheduling v3

From: Dario Faggioli
Date: Tue Oct 29 2019 - 05:11:29 EST

Next message: Felipe Balbi: "Re: [PATCH v4 5/9] usb: dwc3: Rework clock initialization to be more flexible"
Previous message: Felipe Balbi: "Re: [PATCH v4 3/9] usb: dwc3: Increase timeout for CmdAct cleared by device controller"
Next in thread: Dario Faggioli: "Re: [RFC PATCH v3 00/16] Core scheduling v3"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> I'm using the following branch as base which is v5.1.5 based:
> https://github.com/digitalocean/linux-coresched coresched-v3-v5.1.5-
> test
>
> And I have pushed Tim's branch to:
> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
>
> Mine:
> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> core_vruntime
>
Hello,

As anticipated, I've been trying to follow the development of this
feature and, in the meantime, I have done some benchmarks.

I actually have a lot of data (and am planning for more), so I am
sending a few emails, each one with a subset of the numbers in it,
instead than just one which would be beyond giant! :-)

I'll put, in this first one, some background and some common
information, e.g., about the benchmarking platform and configurations,
and on how to read and interpreet the data that will follow.

It's quite hard to come up with a concise summary, and sometimes it's
even tricky to identify consolidated trends. There are also things that
looks weird and, although I double checked my methodology, I can't
exclude that of glitches or errors may have occurred. For each of the
benchmark, I have at least some information about what the
configuration was when it was run, and also some monitorning and perf
data. So, if interested, try to ask and we'll see what we can dig out.

And in any case, I have the procedure for running these benchmarks
fairly decently (although not completely) automated. So if we see
things that looks really really weird, I can rerun (perhaps with
different configuration, more monitoring, etc).

For each benchmark, I'll "dump" the results, with just some comments
about the things that I find more relevant/interesting. Then, if we
want, we can look at them and analyze them together.
For each experiment, I do have some limited amount of tracing and
debugging information still available, in case it could be useful. And,
as said, I can always rerun.

I can also provide, quite easily, different looking tables. E.g.,
different set of columns, different baselines, etc. Just as what you
thinks it would be the most interesting to see, and, most likely, it
will be possible to do it.

Oh, and I'll upload text files whose contents will be identical to the
emails in this space:

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/

In case tables are rendered better in a browser than in a MUA.

Thanks and Regards,
Dario
---

Code:
1) Linux 5.1.5 (commit id 835365932f0dc25468840753e071c05ad6abc76f)
2) https://github.com/digitalocean/linux-coresched/tree/vpillai/coresched-v3-v5.1.5-test
3) https://github.com/aaronlu/linux/tree/coresched-v3-v5.1.5-test-core_vruntime
4) https://github.com/aaronlu/linux/tree/coresched-v3-v5.1.5-test-tim

Benchmarking suite:
- MMTests: https://github.com/gormanm/mmtests
- Tweaked to deal with running benchmarks in VMs. Still working on
upstreaming that to Mel (a WIP is available here:
https://github.com/dfaggioli/mmtests/tree/bench-virt )

Benchmarking host:
- CPU: 1 socket, 4 cores, 2 threads
- RAM: 32 GB
- distro: opneSUSE Tumbleweed
- HW Bugs Mitigations: fully disabled
- Filesys: XFS

VMs:
- vCPUs: either 8 or 4
- distro: opneSUSE Tumbleweed
- Kernel: 5.1.16
- HW Bugs Mitigations: fully disabled
- Filesys: XFS

Benchmarks:
- STREAM : pure memory benchmark (various kind of mem-ops done
in parallel). Parallelism is NR_CPUS/2 tasks
- Kernbench : builds a kernel, with varying number of compile
jobs. HT is, in general, known to help, as it let
us do "more parallel" builds
- Hackbench : communication (via pipes, in this case) between
group of processes. As we deal with _groups_ of
tasks, we're already in saturation with 1 group,
hence we expect HyperThreading disabled
configurations to suffer
- mutilate : load generator for memcached, with high request
rate;
- netperf-unix : two communicating tasks. Without any pinning
(neither at the host nor at the guest level), we
expect HT to play a role. In fact, depending on
where the two task are scheduler (i.e., whether on
two core of the same thread, or not) performance may
vary
- sysbenchcpu : the process-based CPU stressing workload of sysbench
- sysbenchthread : the thread-based CPU stressing workload of sysbench
- sysbench : the database workload

This is kind of a legend for the columns you will see in the tables.

- v-* : vanilla, i.e., benchmarks were run on code _without_ any
core-scheduling patch applied (see 1 in 'Code' section above)
- *BM-* : baremetal, i.e., benchmarks were run on the host, without
any VM running or anything
- *VM-* : Virtual Machine, i.e., benchmarks were run inside a VM, with
the following haracteristics:
- *VM- : benchmarks were run in a VM with 8 vCPUs. That was the
only VM running in the system
- *VM-v4 : benchmarks were run in a VM with 4 vCPUs. That was the
only VM running in the system
- *VMx2 : benchmark were run in a VM with 8 vCPUs, and there was
another VM running, also with 8 vCPUS, generating CPU,
memory and IO stress load for about 600%
- *-csc-* : benchmarks were run with Core scheduling v3 patch
series (see 2 in 'Code' section above)
- *-csc_stallfix-* : benchmarks were run with Core scheduling v3 and
the 'stallfix' feature enabled
- *-csc_vruntime-* : benchmarks were run with Core scheduling v3 + the
vruntime patches (see 3 in 'Code' section above)
- *-csc-_tim-* : benchmarks were run with Core scheduling v3 +
Tim's patches (see 4 in 'Code' section above)
- *-noHT : benchmarks were run with HyperThreading Disabled
- *-HT : benchmarks were run with Hyperthreading enabled

So, for instance, the column BM-noHT shows data from a run done on
baremetal, with HyperThreading disabled. The column v-VM-HT shows data
from a run done in a 8 vCPUs VM, with HyperThreading enabled, and no
core-scheduling patches applied. The column VM-csc_vruntime-HT shows
data from a run done in a 8 vCPUs VM with core-scheduling v3 patches +
the vruntime patches applied. The column VM-v4-HT shows data from a run
done in a 4 vCPUs VM, core-scheduling patches were applied but not used
(the vCPUs of the VM weren't tagged). The column VMx2-csc_vruntime-HT
shows data from a run done in a 8 vCPUs VM, core-scheduling v3 + Tim's
patchs were applied and the vCPUs of the VM tagged, while there was
another (untagged) VM in the system, trying to introduce ~600% load
(CPU, memory and IO, via stress-ng). Etc.

See the 'Appendix' at the bottom of this email, for a comprehensive
list of all the combinations (or, at least I think is comprehensive...
I hope I haven't missed any :-) ).

In all tables, percent increase and decrease are always relative to the
first column. It is already taken care of whether lower or higher
values are better.
Basically, when we see -x.yz%, it always means performance are worse
than the baseline, and the absolute value of that (i.e., x.yz) tells
you by how much.

If, for instance, we want to compare HT and non HT, on baremetal, we
check the BM-HT and BM-noHT columns.
If we want to compare v3 + vruntime patches against no HyperThreading,
when the system is overloaded, we look at VMx2-noHT and VMx2-
csc_vruntime-HT columns and check by how much they deviates from the
baseline (i.e., which one regresses more). For comparing, the various
core scheduling solutions, we can check by how much each one is either
better or worse than baseline. And so on...

The most relevant comparisons, IMO, are:
- the various core scheduling solutions against their respective HT
baseline. This, in fact, tells us what people will experience if they
start using core scheduling on these workloads
- the various core scheduling solutions against their respective noHT
baseline. This, in fact, tells use whether or not core scheduling is
effective, for the given workload, or if it would just be better to
disable HyperThreading
- the overhead introduced by the core scheduling patches, when they are
not used (i.e., v-BM-HT against BM-HT, or v-VM-HT against VM-HT). This,
in fact, tells us what happens to *everyone*, including the ones that
do not want core scheduling and will keep it disabled, if we merge it

Note that the overhead, so far, has been evaluated only for the -csc
case, i.e., when patches from point 2 in 'Code' above are applied, but
tasks/vCPUs are not tagged, and hence core scheduling is not really
used,

Anyway, let's get to the point where I give you some data already! :-D
:-D

STREAM
======

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-1_stream.txt

v v BM BM BM BM BM BM
BM-HT BM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
MB/sec copy 33827.50 ( 0.00%) 33654.32 ( -0.51%) 33683.34 ( -0.43%) 33819.30 ( -0.02%) 33830.88 ( 0.01%) 33731.02 ( -0.29%) 33573.76 ( -0.75%) 33292.76 ( -1.58%)
MB/sec scale 22762.02 ( 0.00%) 22524.00 ( -1.05%) 22416.54 ( -1.52%) 22444.16 ( -1.40%) 22652.56 ( -0.48%) 22462.80 ( -1.31%) 22461.90 ( -1.32%) 22670.84 ( -0.40%)
MB/sec add 26141.76 ( 0.00%) 26241.42 ( 0.38%) 26559.40 ( 1.60%) 26365.36 ( 0.86%) 26607.10 ( 1.78%) 26384.50 ( 0.93%) 26117.78 ( -0.09%) 26192.12 ( 0.19%)
MB/sec triad 26522.46 ( 0.00%) 26555.26 ( 0.12%) 26499.62 ( -0.09%) 26373.26 ( -0.56%) 26667.32 ( 0.55%) 26642.70 ( 0.45%) 26505.38 ( -0.06%) 26409.60 ( -0.43%)
v v VM VM VM VM VM VM
VM-HT VM-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
MB/sec copy 34559.32 ( 0.00%) 34153.30 ( -1.17%) 34236.64 ( -0.93%) 33724.38 ( -2.42%) 33535.60 ( -2.96%) 33534.10 ( -2.97%) 33469.70 ( -3.15%) 33873.18 ( -1.99%)
MB/sec scale 22556.18 ( 0.00%) 22834.88 ( 1.24%) 22733.12 ( 0.78%) 23010.46 ( 2.01%) 22480.60 ( -0.34%) 22552.94 ( -0.01%) 22756.50 ( 0.89%) 22434.96 ( -0.54%)
MB/sec add 26209.70 ( 0.00%) 26640.08 ( 1.64%) 26692.54 ( 1.84%) 26747.40 ( 2.05%) 26358.20 ( 0.57%) 26353.50 ( 0.55%) 26686.62 ( 1.82%) 26256.50 ( 0.18%)
MB/sec triad 26521.80 ( 0.00%) 26490.26 ( -0.12%) 26598.66 ( 0.29%) 26466.30 ( -0.21%) 26560.48 ( 0.15%) 26496.30 ( -0.10%) 26609.10 ( 0.33%) 26450.68 ( -0.27%)
v v VM VM VM VM VM VM
VM-v4-HT VM-v4-noHT v4-HT v4-noHT v4-csc-HT v4-csc_stallfix-HT v4-csc_tim-HT v4-csc_vruntime-HT
MB/sec copy 32257.48 ( 0.00%) 32504.18 ( 0.76%) 32375.66 ( 0.37%) 32261.98 ( 0.01%) 31940.84 ( -0.98%) 32070.88 ( -0.58%) 31926.80 ( -1.03%) 31882.18 ( -1.16%)
MB/sec scale 19806.46 ( 0.00%) 20281.18 ( 2.40%) 20266.80 ( 2.32%) 20075.46 ( 1.36%) 19847.66 ( 0.21%) 20119.00 ( 1.58%) 19899.84 ( 0.47%) 20060.48 ( 1.28%)
MB/sec add 22178.58 ( 0.00%) 22426.92 ( 1.12%) 22185.54 ( 0.03%) 22153.52 ( -0.11%) 21975.80 ( -0.91%) 22097.72 ( -0.36%) 21827.66 ( -1.58%) 22068.04 ( -0.50%)
MB/sec triad 22149.10 ( 0.00%) 22200.54 ( 0.23%) 22142.10 ( -0.03%) 21933.04 ( -0.98%) 21898.50 ( -1.13%) 22160.64 ( 0.05%) 22003.40 ( -0.66%) 21951.16 ( -0.89%)
v v VMx2 VMx2 VMx2 VMx2 VMx2 VMx2
VMx2-HT VMx2-noHT HT noHT csc-HT csc_stallfix-HT csc_tim-HT csc_vruntime-HT
MB/sec copy 33514.96 ( 0.00%) 24740.70 ( -26.18%) 30410.96 ( -9.26%) 22157.24 ( -33.89%) 29552.60 ( -11.82%) 29374.78 ( -12.35%) 28717.38 ( -14.31%) 29143.88 ( -13.04%)
MB/sec scale 22605.74 ( 0.00%) 15473.56 ( -31.55%) 19051.76 ( -15.72%) 15278.64 ( -32.41%) 19246.98 ( -14.86%) 19081.04 ( -15.59%) 18747.60 ( -17.07%) 18776.02 ( -16.94%)
MB/sec add 26249.56 ( 0.00%) 18559.92 ( -29.29%) 21143.90 ( -19.45%) 18664.30 ( -28.90%) 21236.00 ( -19.10%) 21067.40 ( -19.74%) 20878.78 ( -20.46%) 21266.92 ( -18.98%)
MB/sec triad 26290.16 ( 0.00%) 19274.10 ( -26.69%) 20573.62 ( -21.74%) 17631.52 ( -32.93%) 21066.94 ( -19.87%) 20975.04 ( -20.22%) 20944.56 ( -20.33%) 20942.18 ( -20.34%)

So, STREAM, at least in this configuration, it is not (as it could have
been expected) really sensitive to HyperThreading. In fact, in most
cases, both when run on baremetal and in VMs, HT and noHT results are
pretty much the same. When core scheduling is used, things does not
look bad at all to me, although results are, most of the time, only
marginally worse.

Do check, however, the overloaded case. There, disabling HT has quite a
big impact, and core scheduling does a rather good job in restoring
good performance.

From the overhead point of view, the situation does not look too bad
either. In fact, in the first three group of measurements, the overhead
introduced by having core scheduling patches in, is acceptable (there
are actually cases where they seem to do more good than harm! :-P).
However, when the system is overloaded, despite there not being any
tagged task, numbers look pretty bad. It seems that, for instance, of
the 13.04% performance drop between v-VMx2-HT and VMx2-csc_vruntime-HT,
9.26% comes from overhead (as that's there already in VMx2-HT)!!

Something to investigate better, I guess...

Appendix

* v-BM-HT : no coresched patch applied, baremetal, HyperThreading enabled
* v-BM-noHT : no coresched patch applied, baremetal, Hyperthreading disabled
* v-VM-HT : no coresched patch applied, 8 vCPUs VM, HyperThreading enabled
* v-VM-noHT : no coresched patch applied, 8 vCPUs VM, Hyperthreading disabled
* v-VM-v4-HT : no coresched patch applied, 4 vCPUs VM, HyperThreading enabled
* v-VM-v4-noHT : no coresched patch applied, 4 vCPUs VM, Hyperthreading disabled
* v-VMx2-HT : no coresched patch applied, 8 vCPUs VM + 600% stress overhead, HyperThreading enabled
* v-VMx2-noHT : no coresched patch applied, 8 vCPUs VM + 600% stress overhead, Hyperthreading disabled

* BM-HT : baremetal, HyperThreading enabled
* BM-noHT : baremetal, Hyperthreading disabled
* BM-csc-HT : baremetal, coresched-v3 (Hyperthreading enabled, of course)
* BM-csc_stallfix-HT : baremetal, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* BM-csc_tim-HT : baremetal, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* BM-csc_vruntime-HT : baremetal, coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

* VM-HT : 8 vCPUs VM, HyperThreading enabled
* VM-noHT : 8 vCPUs VM, Hyperthreading disabled
* VM-csc-HT : 8 vCPUs VM, coresched-v3 (Hyperthreading enabled, of course)
* VM-csc_stallfix-HT : 8 vCPUs VM, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* VM-csc_tim-HT : 8 vCPUs VM, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* VM-csc_vruntime-HT : 8 vCPUs VM, coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

* VM-v4-HT : 4 vCPUs VM, HyperThreading enabled
* VM-v4-noHT : 4 vCPUs VM, Hyperthreading disabled
* VM-v4-csc-HT : 4 vCPUs VM, coresched-v3 (Hyperthreading enabled, of course)
* VM-v4-csc_stallfix-HT : 4 vCPUs VM, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* VM-v4-csc_tim-HT : 4 vCPUs VM, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* VM-v4-csc_vruntime-HT : 4 vCPUs VM, coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

* VMx2-HT : 8 vCPUs VM + 600% stress overhead, HyperThreading enabled
* VMx2-noHT : 8 vCPUs VM + 600% stress overhead, Hyperthreading disabled
* VMx2-csc-HT : 8 vCPUs VM + 600% stress overhead, coresched-v3 (Hyperthreading enabled, of course)
* VMx2-csc_stallfix-HT : 8 vCPUs VM + 600% stress overhead, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* VMx2-csc_tim-HT : 8 vCPUs VM + 600% stress overhead, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* VMx2-csc_vruntime-HT : 8 vCPUs VM + 600% stress overhead,
coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Attachment: signature.asc
Description: This is a digitally signed message part

Next message: Felipe Balbi: "Re: [PATCH v4 5/9] usb: dwc3: Rework clock initialization to be more flexible"
Previous message: Felipe Balbi: "Re: [PATCH v4 3/9] usb: dwc3: Increase timeout for CmdAct cleared by device controller"
Next in thread: Dario Faggioli: "Re: [RFC PATCH v3 00/16] Core scheduling v3"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]