Re: [PATCH v2] KVM: x86: Use fast path for Xen timer delivery

From: David Woodhouse
Date: Fri Sep 29 2023 - 17:25:31 EST


On Fri, 2023-09-29 at 08:16 -0700, Sean Christopherson wrote:
> On Fri, Sep 29, 2023, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@xxxxxxxxxxxx>
> >
> > Most of the time there's no need to kick the vCPU and deliver the timer
> > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast()
> > directly from the timer callback, and only fall back to the slow path
> > when it's necessary to do so.
>
> It'd be helpful for non-Xen folks to explain "when it's necessary".  IIUC, the
> only time it's necessary is if the gfn=>pfn cache isn't valid/fresh.

That's an implementation detail. Like all of the fast path functions
that can be called from kvm_arch_set_irq_inatomic(), it has its own
criteria for why it might return -EWOULDBLOCK or not. Those are *its*
business. And in fact one of Paul's current patches is tweaking them
subtly, but that isn't relevant here. (But yes, you are broadly correct
in your understanding.)

> > This gives a significant improvement in timer latency testing (using
> > nanosleep() for various periods and then measuring the actual time
> > elapsed).
> >
> > However, there was a reason¹ the fast path was dropped when this support
>
> Heh, please use [1] or [*] like everyone else.  I can barely see that tiny little ¹.

Isn't that the *point*? The reference to the footnote isn't supposed to
detract from the flow of the main text. It's exactly how you'll see it
when typeset properly. I've always assumed the people using [1] or [*]
just haven't yet realised that it's the 21st century and we are no
longer limited to 7-bit ASCII. Or haven't worked out how to type
anything but ASCII.

> > was first added. The current code holds vcpu->mutex for all operations on
> > the kvm->arch.timer_expires field, and the fast path introduces potential
> > race conditions. So... ensure the hrtimer is *cancelled* before making
> > changes in kvm_xen_start_timer(), and also when reading the values out
> > for KVM_XEN_VCPU_ATTR_TYPE_TIMER.
> >
> > Add some sanity checks to ensure the truth of the claim that all the
> > other code paths are run with the vcpu loaded.  And use hrtimer_cancel()
> > directly from kvm_xen_destroy_vcpu() to avoid a false positive from the
> > check in kvm_xen_stop_timer().
>
> This should really be at least 2 patches, probably 3:
>
>   1. add the assertions and the destroy_vcpu() change
>   2. cancel the timer before starting a new one or reading the value from userspace
>   3. use the fastpath delivery from the timer callback

Hm, I think that's borderline. I pondered it and came to the opposite
conclusion. Cancelling the timer wasn't needed without the fastpath
delivery; it isn't a separate fix. You could consider it a preparatory
patch I suppose... but I didn't. It's not like adding the fast path in
itself is complex enough that the other parts need to be broken out.

> > ¹ https://lore.kernel.org/kvm/846caa99-2e42-4443-1070-84e49d2f11d2@xxxxxxxxxx/
> >
> > Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
> > ---
> >
> >  • v2: Remember, and deal with, those races.
> >
> >  arch/x86/kvm/xen.c | 64 +++++++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 58 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> > index fb1110b2385a..9d0d602a2466 100644
> > --- a/arch/x86/kvm/xen.c
> > +++ b/arch/x86/kvm/xen.c
> > @@ -117,6 +117,8 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
> >  
> >  void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu)
> >  {
> > +       WARN_ON_ONCE(vcpu != kvm_get_running_vcpu());
> > +
> >         if (atomic_read(&vcpu->arch.xen.timer_pending) > 0) {
> >                 struct kvm_xen_evtchn e;
> >  
> > @@ -136,18 +138,41 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
> >  {
> >         struct kvm_vcpu *vcpu = container_of(timer, struct kvm_vcpu,
> >                                              arch.xen.timer);
> > +       struct kvm_xen_evtchn e;
> > +       int rc;
> > +
> >         if (atomic_read(&vcpu->arch.xen.timer_pending))
> >                 return HRTIMER_NORESTART;
> >  
> > -       atomic_inc(&vcpu->arch.xen.timer_pending);
> > -       kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
> > -       kvm_vcpu_kick(vcpu);
> > +       e.vcpu_id = vcpu->vcpu_id;
> > +       e.vcpu_idx = vcpu->vcpu_idx;
> > +       e.port = vcpu->arch.xen.timer_virq;
> > +       e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
> > +
> > +       rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm);
> > +       if (rc == -EWOULDBLOCK) {
> > +               atomic_inc(&vcpu->arch.xen.timer_pending);
> > +               kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
> > +               kvm_vcpu_kick(vcpu);
> > +       } else {
> > +               vcpu->arch.xen.timer_expires = 0;
>
> Eww.  IIUC, timer_expires isn't cleared so that the pending IRQ is captured by
> kvm_xen_vcpu_get_attr(), i.e. because xen.timer_pending itself isn't migrated.

-EPARSE. Er... yes?

The timer_expires field is non-zero when there's a pending timer. When
the timer interrupt has fired, the time is no longer pending and the
timer_expires field is set to zero.

The xen.timer_pending field is indeed not migrated. We *flush* it in
kvm_xen_vcpu_get_attr(). It's now only used for the *deferral* to the
slow path.

So... yes, timer_expires *wasn't* previously cleared, because the timer
IRQ hadn't been delivered yet, and yes that was to avoid races with
kvm_xen_vcpu_get_attr(). Partly because the timer_pending field was
internal and not migrated. As far as userspace is concerned, the timer
has either fired, or it has not. Implementation details of the
timer_pending field — and the per-vcpu evtchn_pending_sel — are not
exposed.

> > +       }
> >  
> >         return HRTIMER_NORESTART;
> >  }
> >  
> >  static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
> >  {
> > +       WARN_ON_ONCE(vcpu != kvm_get_running_vcpu());
> > +
> > +       /*
> > +        * Avoid races with the old timer firing. Checking timer_expires
> > +        * to avoid calling hrtimer_cancel() will only have false positives
> > +        * so is fine.
> > +        */
> > +       if (vcpu->arch.xen.timer_expires)
> > +               hrtimer_cancel(&vcpu->arch.xen.timer);
> > +
> >         atomic_set(&vcpu->arch.xen.timer_pending, 0);
> >         vcpu->arch.xen.timer_expires = guest_abs;
> >  
> > @@ -163,6 +188,8 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_
> >  
> >  static void kvm_xen_stop_timer(struct kvm_vcpu *vcpu)
> >  {
> > +       WARN_ON_ONCE(vcpu != kvm_get_running_vcpu());
> > +
> >         hrtimer_cancel(&vcpu->arch.xen.timer);
> >         vcpu->arch.xen.timer_expires = 0;
> >         atomic_set(&vcpu->arch.xen.timer_pending, 0);
> > @@ -1019,13 +1046,38 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
> >                 r = 0;
> >                 break;
> >  
> > -       case KVM_XEN_VCPU_ATTR_TYPE_TIMER:
> > +       case KVM_XEN_VCPU_ATTR_TYPE_TIMER: {
> > +               bool pending = false;
> > +
> > +               /*
> > +                * Ensure a consistent snapshot of state is captures, with a
>
> s/captures/captured

Ack.

> > +                * timer either being pending, or fully delivered. Not still
>
> Kind of a nit, but IMO "fully delivered" isn't accurate, at least not without
> more clarification.  I would considered "fully delivered" to mean that the IRQ
> has caused the guest to start executing its IRQ handler.  Maybe "fully queued in
> the event channel"?

OK, I'll reword.


> > +                * lurking in the timer_pending flag for deferred delivery.
> > +                */
> > +               if (vcpu->arch.xen.timer_expires) {
> > +                       pending = hrtimer_cancel(&vcpu->arch.xen.timer);
> > +                       kvm_xen_inject_timer_irqs(vcpu);
>
> If the goal is to not have pending timers, then kvm_xen_inject_timer_irqs()
> should be called unconditionally after canceling the hrtimer, no?

It *is* called unconditionally after cancelling the hrtimer.

Or did you mean unconditionally whether we cancel the hrtimer or not?
The comment explains the logic for not needing that. If *either* the
timer is still active, *or* it's already fired and has taken the slow
path and set the timer_pending flag, then timer_expires won't have been
zeroed yet. So the race conditions inherent in doing this conditional
on vcpu->arch.xen.timer_expires are fine because there are only false
positives (which cause us to cancel a timer, and try to inject a
pending IRQ, which wasn't running and wasn't pending respectively).

Sounds like I need to expand that comment?

> > +               }
> > +
> >                 data->u.timer.port = vcpu->arch.xen.timer_virq;
> >                 data->u.timer.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
> >                 data->u.timer.expires_ns = vcpu->arch.xen.timer_expires;
> > +
> > +               /*
> > +                * The timer may be delivered immediately, while the returned
> > +                * state causes it to be set up and delivered again on the
>
> Similar to the "fully delivered" stuff above, maybe s/timer/hrtimer to make it
> a bit more clear that the host hrtimer can fire twice, but it won't ever result
> in two timer IRQs being delivered from the guest's perspective.

Ack.

> > +                * destination system after migration. That's fine, as the
> > +                * guest will not even have had a chance to run and process
> > +                * the interrupt by that point, so it won't even notice the
> > +                * duplicate IRQ.
>
> Rather than say "so it won't even notice the duplicate IRQ", maybe be more explicit
> and say "so the queued IRQ is guaranteed to be coalesced in the event channel
> and/or guest local APIC".  Because I read the whole "delivered IRQ" stuff as there
> really being two injected IRQs into the guest.
>
> FWIW, this is all really gross, but I agree that even if the queued IRQ makes it
> all the way to the guest's APIC, the IRQs will still be coalesced in the end.
>

As discussed before, we *could* have made fetching the timer attr also
*pause* the timer. It just seemed like extra complexity for no good
reason. The shared info page containing the event channel bitmap has to
be migrated after the vCPU state anyway.


>
> > +                */
> > +               if (pending)
> > +                       hrtimer_start_expires(&vcpu->arch.xen.timer,
> > +                                             HRTIMER_MODE_ABS_HARD);
> > +
> >                 r = 0;
> >                 break;
> > -
> > +       }
> >         case KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR:
> >                 data->u.vector = vcpu->arch.xen.upcall_vector;
> >                 r = 0;
> > @@ -2085,7 +2137,7 @@ void kvm_xen_init_vcpu(struct kvm_vcpu *vcpu)
> >  void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu)
> >  {
> >         if (kvm_xen_timer_enabled(vcpu))
>
> IIUC, this can more precisely be:
>
>         if (vcpu->arch.xen.timer_expires)
>                 hrtimer_cancel(&vcpu->arch.xen.timer);
>
> at which point it might make sense to add a small helper
>
>   static void kvm_xen_cancel_timer(struct kvm_vcpu *vcpu)
>   {
>         if (vcpu->arch.xen.timer_expires)
>                 hrtimer_cancel(&vcpu->arch.xen.timer);
>   }
>
> to share code with
>
> > -               kvm_xen_stop_timer(vcpu);
> > +               hrtimer_cancel(&vcpu->arch.xen.timer);
> >  
> >         kvm_gpc_deactivate(&vcpu->arch.xen.runstate_cache);
> >         kvm_gpc_deactivate(&vcpu->arch.xen.runstate2_cache);

In fact if I make the helper return a bool, I think I can use it
*three* times. At a cost of having a third function
kvm_xen_cancel_timer() alongside the existing kvm_xen_start_timer() and
kvm_xen_stop_timer(), and those *three* now make for a slightly
confusing set. I'll take a look and see how it makes me feel.

Attachment: smime.p7s
Description: S/MIME cryptographic signature