Re: [RFC 0/3] KVM: x86: introduce pv feature lazy tscdeadline

From: Wang Jianchao
Date: Thu Jul 13 2023 - 07:25:10 EST




On 2023.07.13 18:27, Xiaoyao Li wrote:
> On 7/13/2023 2:57 PM, Zhi Wang wrote:
>> On Thu, 13 Jul 2023 10:50:36 +0800
>> Wang Jianchao <jianchwa@xxxxxxxxxxx> wrote:
>>
>>>
>>>
>>> On 2023.07.13 02:14, Zhi Wang wrote:
>>>> On Fri,  7 Jul 2023 14:17:58 +0800
>>>> Wang Jianchao <jianchwa@xxxxxxxxxxx> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> This patchset attemps to introduce a new pv feature, lazy tscdeadline.
>>>>> Everytime guest write msr of MSR_IA32_TSC_DEADLINE, a vm-exit occurs
>>>>> and host side handle it. However, a lot of the vm-exit is unnecessary
>>>>> because the timer is often over-written before it expires.
>>>>>
>>>>> v : write to msr of tsc deadline
>>>>> | : timer armed by tsc deadline
>>>>>
>>>>>           v v v v v        | | | | |
>>>>> --------------------------------------->  Time
>>>>>
>>>>> The timer armed by msr write is over-written before expires and the
>>>>> vm-exit caused by it are wasted. The lazy tscdeadline works as following,
>>>>>
>>>>>           v v v v v        |       |
>>>>> --------------------------------------->  Time
>>>>>                            '- arm -'
>>>>>
>>>>
>>>> Interesting patch.
>>>>
>>>> I am a little bit confused of the chart above. It seems the write of MSR,
>>>> which is said to cause VM exit, is not reduced in the chart of lazy
>>>> tscdeadline, only the times of arm are getting less. And the benefit of
>>>> lazy tscdeadline is said coming from "less vm exit". Maybe it is better
>>>> to imporve the chart a little bit to help people jump into the idea
>>>> easily?
>>>
>>> Thanks so much for you comment and sorry for my poor chart.
>>>
>>
>> You don't have to say sorry here. :) Save it for later when you actually
>> break something.
>>
>>> Let me try to rework the chart.
>>>
>>> Before this patch, every time guest start or modify a hrtimer, we need to write the msr of tsc deadline,
>>> a vm-exit occurs and host arms a hv or sw timer for it.
>>>
>>>
>>> w: write msr
>>> x: vm-exit
>>> t: hv or sw timer
>>>
>>>
>>> Guest
>>>           w
>>> --------------------------------------->  Time
>>> Host     x              t
>>>  
>>> However, in some workload that needs setup timer frequently, msr of tscdeadline is usually overwritten
>>> many times before the timer expires. And every time we modify the tscdeadline, a vm-exit ocurrs
>>>
>>>
>>> 1. write to msr with t0
>>>
>>> Guest
>>>           w0
>>> ---------------------------------------->  Time
>>> Host     x0             t0
>>>
>>>   2. write to msr with t1
>>> Guest
>>>               w1
>>> ------------------------------------------>  Time
>>> Host         x1          t0->t1
>>>
>>>
>>> 2. write to msr with t2
>>> Guest
>>>                  w2
>>> ------------------------------------------>  Time
>>> Host            x2          t1->t2
>>>  
>>> 3. write to msr with t3
>>> Guest
>>>                      w3
>>> ------------------------------------------>  Time
>>> Host                x3           t2->t3
>>>
>>>
>>>
>>> What this patch want to do is to eliminate the vm-exit of x1 x2 and x3 as following,
>>>
>>>
>>> Firstly, we have two fields shared between guest and host as other pv features, saying,
>>>   - armed, the value of tscdeadline that has a timer in host side, only updated by __host__ side
>>>   - pending, the next value of tscdeadline, only updated by __guest__ side
>>>
>>>
>>> 1. write to msr with t0
>>>
>>>               armed   : t0
>>>               pending : t0
>>> Guest
>>>           w0
>>> ---------------------------------------->  Time
>>> Host     x0             t0
>>>
>>> vm-exit occurs and arms a timer for t0 in host side
>>>
>>>   2. write to msr with t1
>>>
>>>               armed   : t0
>>>               pending : t1
>>>
>>> Guest
>>>               w1
>>> ------------------------------------------>  Time
>>> Host                     t0
>>>
>>> the value of tsc deadline that has been armed, namely t0, is smaller than t1, needn't to write
>>> to msr but just update pending
>>>
>>>
>>> 3. write to msr with t2
>>>
>>>               armed   : t0
>>>               pending : t2
>>>   Guest
>>>                  w2
>>> ------------------------------------------>  Time
>>> Host                      t0
>>>   Similar with step 2, just update pending field with t2, no vm-exit
>>>
>>>
>>> 4.  write to msr with t3
>>>
>>>               armed   : t0
>>>               pending : t3
>>>
>>> Guest
>>>                      w3
>>> ------------------------------------------>  Time
>>> Host                       t0
>>> Similar with step 2, just update pending field with t3, no vm-exit
>>>
>>>
>>> 5.  t0 expires, arm t3
>>>
>>>               armed   : t3
>>>               pending : t3
>>>
>>>
>>> Guest
>>>                              ------------------------------------------>  Time
>>> Host                       t0  ------> t3
>>>
>>> t0 is fired, it checks the pending field and re-arm a timer based on it.
>>>
>>>
>>> Here is the core ideal of this patch ;)
>>>
>>
>> That's much better. Please keep this in the cover letter in the next RFC.
>>
>> My concern about this approach is: it might slightly affect timing
>> sensitive workload in the guest, as the approach merges the deadline
>> interrupt. The guest might see less deadline interrupts than before. It
>> might be better to have a comparison of number of deadline interrupts
>> in the cover letter.
>
> I don't think guest will get less deadline interrupts since the deadline is updated always before the timer expires.
>
> However, host will get more deadline interrupt because timer for t0 is not disarmed when new deadline (t1, t2, t3) is programmed.
>

I forget to avoid to inject local timer interrupt of t0 in this version. This will be modified in V3 patchset.
But there is still a vm-exit of preemption timer for t0 ...
The worst case is: guest program t0 t1, t1's vm-exit due to msr write is avoided but t0's preemption vm-exit replace it.
In the other case, there should be benefit of vm-exit.

Thanks
Jianchao