Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenariosin PLE handler

From: Raghavendra K T
Date: Thu Sep 27 2012 - 06:25:25 EST


On 09/26/2012 06:27 PM, Andrew Jones wrote:
On Mon, Sep 24, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote:
On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:
On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
In some special scenarios like #vcpu<= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

What's the costly thing? The vm-exit, the yield (which should be a nop
if its the only task there) or something else entirely?

Both vmexit and yield_to() actually,

because unsuccessful yield_to() overall is costly in PLE handler.

This is because when we have large guests, say 32/16 vcpus, and one
vcpu is holding lock, rest of the vcpus waiting for the lock, when they
do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
the VM and try to do directed yield (unsuccessful). (O(n^2) tries).

this results is fairly high amount of cpu burning and double run queue
lock contention.

(if they were spinning probably lock progress would have been faster).
As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
seems little complex to achieve currently.

OK, so the vmexit stays and we need to improve yield_to.

Can't we do this check sooner as well, as it only requires per-cpu data?
If we do it way back in kvm_vcpu_on_spin, then we avoid get_pid_task()
and a bunch of read barriers from kvm_for_each_vcpu. Also, moving the test
into kvm code would allow us to do other kvm things as a result of the
check in order to avoid some vmexits. It looks like we should be able to
avoid some without much complexity by just making a per-vm ple_window
variable, and then, when we hit the nr_running == 1 condition, also doing
vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP))
Reset the window to the default value when we successfully yield (and
maybe we should limit the number of bumps).

We indeed checked early in original undercommit patch and it has given
result closer to PLE disabled case. But Agree with Peter that it is ugly to export nr_running info to ple handler.

Looking at the result and comparing result of A and C,
Base = 3.6.0-rc5 + ple handler optimization patches
A = Base + checking rq_running in vcpu_on_spin() patch
B = Base + checking rq->nr_running in sched/core
C = Base - PLE

% improvements w.r.t BASE
---+------------+------------+------------+
| A | B | C |
---+------------+------------+------------+
1x | 206.37603 | 139.70410 | 210.19323 |

I have a feeling that vmexit has not caused significant overhead
compared to iterating over vcpus in PLE handler.. Does it not sound so?

But
vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP))

is worth trying. I will have to see it eventually.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/