Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

From: Raghavendra K T
Date: Fri Jun 22 2012 - 17:02:35 EST


On 06/22/2012 08:41 PM, Andrew Jones wrote:
On Thu, Jun 21, 2012 at 04:56:08PM +0530, Raghavendra K T wrote:
Here are the results from kernbench.

PS: I think we have to only take that, both the patches perform better,
than reading into actual numbers since I am seeing more variance in
especially 3x. may be I can test with some more stable benchmark if
somebody points


Hi Raghu,


First of all Thank you for your test and raising valid points.
It also made the avenue for discussion of all the different experiments
done over a month (apart from tuning/benchmarking), which may bring
more feedback and precious ideas from community to optimize the performance further.

I shall discuss in reply to this mail separately.

I wonder if we should back up and try to determine the best
benchmark/test environment first.

I agree, we have to be able to produce similar result independently.
So far sysbench (even pgbench) has been consistent, Currently trying,
if other benchmarks like hackbench (modified #loops), ebizzy/dbench
have low variance.

[ but they too are dependent on #client/threads etc ]

I think kernbench is good, but

Yes kernbench atleast helped me to tune SPIN_THRESHOLD to good extent.
But Jeremy also had pointed out that kernbench is little inconsistent.

I wonder about how to simulate the overcommit, and to what degree
(1x, 3x, ??). What are you currently running to simulate overcommit
now? Originally we were running kernbench in one VM and cpu hogs
(bash infinite loops) in other VMs. Then we added vcpus and infinite
loops to get up to the desired overcommit. I saw later that you've
experimented with running kernbench in the other VMs as well, rather
than cpu hogs. Is that still the case?


Yes, I am now running same benchmark on all the guest.

on non PLE, while 1 cpuhogs, played good role of simulating LHP, but on
PLE machine It did not seem to be the case.

I started playing with benchmarking these proposals myself, but so
far have stuck to the cpu hog, since I wanted to keep variability
limited. However, when targeting a reasonable host loadavg with a
bunch of cpu hog vcpus, it limits the overcommit too much. I certainly
haven't tried 3x this way. So I'm inclined to throw out the cpu hog
approach as well. The question is, what to replace it with? It appears
that the performance of the PLE and pvticketlock proposals are quite
dependant on the level of overcommit, so we should choose a target
overcommit level and also a constraint on the host loadavg first,
then determine how to setup a test environment that fits it and yields
results with low variance.

Here are results from my 1.125x overcommit test environment using
cpu hogs.

At first, result seemed backward, but after seeing individual runs and variations, it seems, except for rand start I believe all the result should converge to zero difference. So if we run the same again we may get completely different result.

IMO, on a 64 vcpu guest if we run -j16 it may not represent 1x load, so
what I believe is it has resulted in more of under-commit/nearly 1x
commit result. May be we should try atleast #threads = #vcpu or 2*#vcpu


kcbench (a.k.a kernbench) results; 'mean-time (stddev)'
base-noPLE: 235.730 (25.932)
base-PLE: 238.820 (11.199)
rand_start-PLE: 283.193 (23.262)

Problem currently as we know, in PLE handler we may end up choosing
same VCPU, which was in spinloop, that would unfortunately result in
more cpu burning.

And with randomizing start_vcpu, we are making that probability more.
we need to have a logic, not choose a vcpu that has recently PL exited since it cannot be a lock-holder. and next eligible lock-holder can be
picked up easily with PV patches.

pvticketlocks-noPLE: 244.987 (7.562)
pvticketlocks-PLE: 247.597 (17.200)

base kernel: 3.5.0-rc3 + Rik's new last_boosted patch
rand_start kernel: 3.5.0-rc3 + Raghu's proposed random start patch
pvticketlocks kernel: 3.5.0-rc3 + Rik's new last_boosted patch
+ Raghu's pvticketlock series

Ok, I believe SPIN_THRESHOLD was 2k right? what I had observed is with 2k THRESHOLD, we see halt exit overheads. currently I am trying with
mostly 4k.


The relative standard deviations are as high as 11%. So I'm not
real pleased with the results, and they show degradation everywhere.
Below are the details of the benchmarking. Everything is there except
the kernel config, but our benchmarking should be reproducible with
nearly random configs anyway.

Drew

= host =
- Intel(R) Xeon(R) CPU X7560 @ 2.27GHz
- 64 cpus, 4 nodes, 64G mem
- Fedora 17 with test kernels (see tests)

= benchmark =
- one cpu hog F17 VM
- 64 vcpus, 8G mem
- all vcpus run a bash infinite loop
- kernel: 3.5.0-rc3
- one kcbench (a.k.a kernbench) F17 VM
- 8 vcpus, 8G mem
- 'kcbench -d /mnt/ram', /mnt/ram is 1G ramfs

may be we have to check whether 1GB RAM is ok when we have 128 threads,
not sure..

- kcbench-0.3-8.1.noarch, kcbench-data-2.6.38-0.1-9.fc17.noarch,
kcbench-data-0.1-9.fc17.noarch
- gcc (GCC) 4.7.0 20120507 (Red Hat 4.7.0-5)
- kernel: same test kernel as host

= test 1: base, PLE disabled (ple_gap=0) =
- kernel: 3.5.0-rc3 + Rik's last_boosted patch

Run 1 (-j 16): 4211 (e:237.43 P:637% U:697.98 S:815.46 F:0)
Run 2 (-j 16): 3834 (e:260.77 P:631% U:729.69 S:917.56 F:0)
Run 3 (-j 16): 4784 (e:208.99 P:644% U:638.17 S:708.63 F:0)

mean: 235.730 stddev: 25.932

= test 2: base, PLE enabled =
- kernel: 3.5.0-rc3 + Rik's last_boosted patch

Run 1 (-j 16): 4335 (e:230.67 P:639% U:657.74 S:818.28 F:0)
Run 2 (-j 16): 4269 (e:234.20 P:647% U:743.43 S:772.52 F:0)
Run 3 (-j 16): 3974 (e:251.59 P:639% U:724.29 S:884.21 F:0)

mean: 238.820 stddev: 11.199

= test 3: rand_start, PLE enabled =
- kernel: 3.5.0-rc3 + Raghu's random start patch

Run 1 (-j 16): 3898 (e:256.52 P:639% U:756.14 S:884.63 F:0)
Run 2 (-j 16): 3341 (e:299.27 P:633% U:857.49 S:1039.62 F:0)
Run 3 (-j 16): 3403 (e:293.79 P:635% U:857.21 S:1008.83 F:0)

mean: 283.193 stddev: 23.262

= test 4: pvticketlocks, PLE disabled (ple_gap=0) =
- kernel: 3.5.0-rc3 + Rik's last_boosted patch + Raghu's pvticketlock series
+ PARAVIRT_SPINLOCKS=y config change

Run 1 (-j 16): 3963 (e:252.29 P:647% U:736.43 S:897.16 F:0)
Run 2 (-j 16): 4216 (e:237.19 P:650% U:706.68 S:837.42 F:0)
Run 3 (-j 16): 4073 (e:245.48 P:649% U:709.46 S:884.68 F:0)

mean: 244.987 stddev: 7.562

= test 5: pvticketlocks, PLE enabled =
- kernel: 3.5.0-rc3 + Rik's last_boosted patch + Raghu's pvticketlock series
+ PARAVIRT_SPINLOCKS=y config change

Run 1 (-j 16): 3978 (e:251.32 P:629% U:758.86 S:824.29 F:0)
Run 2 (-j 16): 4369 (e:228.84 P:634% U:708.32 S:743.71 F:0)
Run 3 (-j 16): 3807 (e:262.63 P:626% U:767.03 S:877.96 F:0)

mean: 247.597 stddev: 17.200



Ok in summary,
can we agree like, for kernbench 1x= -j (2*#vcpu) in 1 vm.
1.5x = -j (2*#vcpu) in 1 vm and -j (#vcpu) in other.. and so on.
also a SPIN_THRESHOLD of 4k?

Any ideas on benchmarks is welcome from all.

- Raghu

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/