Re: [RFC PATCH 0/3] generic hypercall support

From: Gregory Haskins
Date: Fri May 08 2009 - 07:10:07 EST


Avi Kivity wrote:
> Marcelo Tosatti wrote:
>> I think comparison is not entirely fair. You're using
>> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
>> (on Intel) to only one register read:
>>
>> nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
>>
>> Whereas in a real hypercall for (say) PIO you would need the address,
>> size, direction and data.
>>
>
> Well, that's probably one of the reasons pio is slower, as the cpu has
> to set these up, and the kernel has to read them.

Right, that was the point I was trying to make. Its real-world overhead
to measure how long it takes KVM to go round-trip in each of the
respective trap types.

>
>> Also for PIO/MMIO you're adding this unoptimized lookup to the
>> measurement:
>>
>> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
>> if (pio_dev) {
>> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
>> complete_pio(vcpu); return 1;
>> }
>>
>
> Since there are only one or two elements in the list, I don't see how
> it could be optimized.

To Marcelo's point, I think he was more taking exception to the fact
that the HC path was potentially completely optimized out if GCC was
super-intuitive about the switch(nr) statement hitting the null vector.
In theory, both the io_bus and the select(nr) are about equivalent in
algorithmic complexity (and depth, I should say) which is why I think in
general the test is "fair". IOW it represents the real-world decode
cycle function for each transport.

However, if one side was artificially optimized simply due to the
triviality of my NULLIO test, that is not fair, and that is the point I
believe he was making. In any case, I just wrote a new version of the
test which hopefully addresses forces GCC to leave it as a more
real-world decode. (FYI: I saw no difference). I will update the
tarball/wiki shortly.

>
>> Whereas for hypercall measurement you don't. I believe a fair comparison
>> would be have a shared guest/host memory area where you store guest/host
>> TSC values and then do, on guest:
>>
>> rdtscll(&shared_area->guest_tsc);
>> pio/mmio/hypercall
>> ... back to host
>> rdtscll(&shared_area->host_tsc);
>>
>> And then calculate the difference (minus guests TSC_OFFSET of course)?
>>
>
> I don't understand why you want host tsc? We're interested in
> round-trip latency, so you want guest tsc all the time.

Yeah, I agree. My take is he was just trying to introduce a real
workload so GCC wouldn't do that potential "cheater decode" in the HC
path. After thinking about it, however, I realized we could do that
with a simple "state++" operation, so the new test does this in each of
the various test's "execute" cycle. The timing calculation remains
unchanged.

-Greg


Attachment: signature.asc
Description: OpenPGP digital signature