Re: [PATCH v4 0/3] KVM: Dynamic Halt-Polling

From: Wanpeng Li
Date: Tue Sep 01 2015 - 20:29:58 EST


On 9/2/15 7:24 AM, David Matlack wrote:
On Tue, Sep 1, 2015 at 3:58 PM, Wanpeng Li <wanpeng.li@xxxxxxxxxxx> wrote:
On 9/2/15 6:34 AM, David Matlack wrote:
On Tue, Sep 1, 2015 at 3:30 PM, Wanpeng Li <wanpeng.li@xxxxxxxxxxx> wrote:
On 9/2/15 5:45 AM, David Matlack wrote:
On Thu, Aug 27, 2015 at 2:47 AM, Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
wrote:
v3 -> v4:
* bring back grow vcpu->halt_poll_ns when interrupt arrives and
shrinks
when idle VCPU is detected

v2 -> v3:
* grow/shrink vcpu->halt_poll_ns by *halt_poll_ns_grow or
/halt_poll_ns_shrink
* drop the macros and hard coding the numbers in the param
definitions
* update the comments "5-7 us"
* remove halt_poll_ns_max and use halt_poll_ns as the max
halt_poll_ns
time,
vcpu->halt_poll_ns start at zero
* drop the wrappers
* move the grow/shrink logic before "out:" w/ "if (waited)"
I posted a patchset which adds dynamic poll toggling (on/off switch). I
think
this gives you a good place to build your dynamic growth patch on top.
The
toggling patch has close to zero overhead for idle VMs and equivalent
performance VMs doing message passing as always-poll. It's a patch
that's
been
in my queue for a few weeks but just haven't had the time to send out.
We
can
win even more with your patchset by only polling as much as we need (via
dynamic growth/shrink). It also gives us a better place to stand for
choosing
a default for halt_poll_ns. (We can run experiments and see how high
vcpu->halt_poll_ns tends to grow.)

The reason I posted a separate patch for toggling is because it adds
timers
to kvm_vcpu_block and deals with a weird edge case (kvm_vcpu_block can
get
called multiple times for one halt). To do dynamic poll adjustment

Why this can happen?
Ah, probably because I'm missing 9c8fd1ba220 (KVM: x86: optimize delivery
of TSC deadline timer interrupt). I don't think the edge case exists in
the latest kernel.

Yeah, hope we both(include Peter Kieser) can test against latest kvm tree to avoid confusing. The reason to introduce the adaptive halt-polling toggle is to handle the "edge case" as you mentioned above. So I think we can make more efforts improve v4 instead. I will improve v4 to handle short halt today. ;-)



correctly,
we have to time the length of each halt. Otherwise we hit some bad edge
cases:

v3: v3 had lots of idle overhead. It's because vcpu->halt_poll_ns
grew
every
time we had a long halt. So idle VMs looked like: 0 us -> 500 us ->
1
ms ->
2 ms -> 4 ms -> 0 us. Ideally vcpu->halt_poll_ns should just stay at
0
when
the halts are long.

v4: v4 fixed the idle overhead problem but broke dynamic growth for
message
passing VMs. Every time a VM did a short halt, vcpu->halt_poll_ns
would
grow.
That means vcpu->halt_poll_ns will always be maxed out, even when
the
halt
time is much less than the max.

I think we can fix both edge cases if we make grow/shrink decisions
based
on
the length of kvm_vcpu_block rather than the arrival of a guest
interrupt
during polling.

Some thoughts for dynamic growth:
* Given Windows 10 timer tick (1 ms), let's set the maximum poll
time
to
less than 1ms. 200 us has been a good value for always-poll. We
can
probably go a bit higher once we have your patch. Maybe 500 us?

Did you test your patch against a windows guest?
I have not. I tested against a 250HZ linux guest to check how it performs
against a ticking guest. Presumably, windows should be the same, but at a
higher tick rate. Do you have a test for Windows?

I just test the idle vCPUs usage.


V4 for windows 10:

+-----------------+----------------+-----------------------+
| | | |
| w/o halt-poll | w/ halt-poll | dynamic(v4) halt-poll |
+-----------------+----------------+-----------------------+
| | | |
| ~2.1% | ~3.0% | ~2.4% |
+-----------------+----------------+-----------------------+

V4 for linux guest:

+-----------------+----------------+-------------------+
| | | |
| w/o halt-poll | w/ halt-poll | dynamic halt-poll |
+-----------------+----------------+-------------------+
| | | |
| ~0.9% | ~1.8% | ~1.2% |
+-----------------+----------------+-------------------+


Regards,
Wanpeng Li


* The base case of dynamic growth (the first grow() after being at
0)
should
be small. 500 us is too big. When I run TCP_RR in my guest I see
poll
times
of < 10 us. TCP_RR is on the lower-end of message passing workload
latency,
so 10 us would be a good base case.

How to get your TCP_RR benchmark?

Regards,
Wanpeng Li
Install the netperf package, or build from here:
http://www.netperf.org/netperf/DownloadNetperf.html

In the vm:

# ./netserver
# ./netperf -t TCP_RR

Be sure to use an SMP guest (we want TCP_RR to be a cross-core message
passing workload in order to test halt-polling).

Ah, ok, I use the same benchmark as yours.

Regards,
Wanpeng Li



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/