Re: [PATCH] KVM: x86: Move kvm_check_request(KVM_REQ_NMI) after kvm_check_request(KVM_REQ_NMI)

From: Mingwei Zhang
Date: Wed Sep 27 2023 - 14:24:20 EST


On Wed, Sep 27, 2023 at 9:10 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Tue, Sep 26, 2023, Xin Li wrote:
> > On 9/26/2023 9:15 PM, Mingwei Zhang wrote:
> > > ah, typo in the subject: The 2nd KVM_REQ_NMI should be KVM_REQ_PMI.
> > > Sorry about that.
> > >
> > > On Tue, Sep 26, 2023 at 9:09 PM Mingwei Zhang <mizhang@xxxxxxxxxx> wrote:
> > > >
> > > > Move kvm_check_request(KVM_REQ_NMI) after kvm_check_request(KVM_REQ_NMI).
> >
> > Please remove it, no need to repeat the subject.
>
> Heh, from Documentation/process/maintainer-kvm-x86.rst:
>
> Changelog
> ~~~~~~~~~
> Most importantly, write changelogs using imperative mood and avoid pronouns.
>
> See :ref:`describe_changes` for more information, with one amendment: lead with
> a short blurb on the actual changes, and then follow up with the context and
> background. Note! This order directly conflicts with the tip tree's preferred
> approach! Please follow the tip tree's preferred style when sending patches
> that primarily target arch/x86 code that is _NOT_ KVM code.
>
> That said, I do prefer that the changelog intro isn't just a copy+paste of the
> shortlog, and the shortlog and changelog should use conversational language instead
> of describing the literal code movement.
>
> > > > When vPMU is active use, processing each KVM_REQ_PMI will generate a
>
> This is not guaranteed.
>
> > > > KVM_REQ_NMI. Existing control flow after KVM_REQ_PMI finished will fail the
> > > > guest enter, jump to kvm_x86_cancel_injection(), and re-enter
> > > > vcpu_enter_guest(), this wasted lot of cycles and increase the overhead for
> > > > vPMU as well as the virtualization.
>
> As above, use conversational language, the changelog isn't meant to be a play-by-play.
>
> E.g.
>
> KVM: x86: Service NMI requests *after* PMI requests in VM-Enter path
>
> Move the handling of NMI requests after PMI requests in the VM-Enter path
> so that KVM doesn't need to cancel and redo VM-Enter in the likely
> scenario that the vCPU has configured its LVPTC entry to generate an NMI.
>
> Because APIC emulation "injects" NMIs via KVM_REQ_NMI, handling PMI
> requests after NMI requests means KVM won't detect the pending NMI request
> until the final check for outstanding requests. Detecting requests at the
> final stage is costly as KVM has already loaded guest state, potentially
> queued events for injection, disabled IRQs, dropped SRCU, etc., most of
> which needs to be unwound.
>
> > Optimization is after correctness, so please explain if this is correct
> > first!
>
> Not first. Leading with an in-depth description of KVM requests and NMI handling
> is not going to help understand *why* this change is being first. But I do agree
> that this should provide an analysis of why it's ok to swap the order, specificially
> why it's architecturally ok if KVM drops an NMI due to the swapped ordering, e.g.
> if the PMI is coincident with two other NMIs (or one other NMI and NMIs are blocked).
>
> > > > So move the code snippet of kvm_check_request(KVM_REQ_NMI) to make KVM
> > > > runloop more efficient with vPMU.
> > > >
> > > > To evaluate the effectiveness of this change, we launch a 8-vcpu QEMU VM on
>
> Avoid pronouns. There's no need for all the "fluff", just state the setup, the
> test, and the results.
>
> Really getting into the nits, but the whole "8-vcpu QEMU VM" versus
> "the setup of using single core, single thread" is confusing IMO. If there were
> potential performance downsides and/or tradeoffs, then getting the gory details
> might be necessary, but that's not the case here, and if it were really necessary
> to drill down that deep, then I would want to better quantify the impact, e.g. in
> terms latency.
>
> E.g. on Intel SPR running SPEC2017 benchmark and Intel vtune in the guest,
> handling PMI requests before NMI requests reduces the number of canceled
> runs by ~1500 per second, per vCPU (counted by probing calls to
> vmx_cancel_injection()).
>
> > > > an Intel SPR CPU. In the VM, we run perf with all 48 events Intel vtune
> > > > uses. In addition, we use SPEC2017 benchmark programs as the workload with
> > > > the setup of using single core, single thread.
> > > >
> > > > At the host level, we probe the invocations to vmx_cancel_injection() with
> > > > the following command:
> > > >
> > > > $ perf probe -a vmx_cancel_injection
> > > > $ perf stat -a -e probe:vmx_cancel_injection -I 10000 # per 10 seconds
> > > >
> > > > The following is the result that we collected at beginning of the spec2017
> > > > benchmark run (so mostly for 500.perlbench_r in spec2017). Kindly forgive
> > > > the incompleteness.
> > > >
> > > > On kernel without the change:
> > > > 10.010018010 14254 probe:vmx_cancel_injection
> > > > 20.037646388 15207 probe:vmx_cancel_injection
> > > > 30.078739816 15261 probe:vmx_cancel_injection
> > > > 40.114033258 15085 probe:vmx_cancel_injection
> > > > 50.149297460 15112 probe:vmx_cancel_injection
> > > > 60.185103088 15104 probe:vmx_cancel_injection
> > > >
> > > > On kernel with the change:
> > > > 10.003595390 40 probe:vmx_cancel_injection
> > > > 20.017855682 31 probe:vmx_cancel_injection
> > > > 30.028355883 34 probe:vmx_cancel_injection
> > > > 40.038686298 31 probe:vmx_cancel_injection
> > > > 50.048795162 20 probe:vmx_cancel_injection
> > > > 60.069057747 19 probe:vmx_cancel_injection
> > > >
> > > > From the above, it is clear that we save 1500 invocations per vcpu per
> > > > second to vmx_cancel_injection() for workloads like perlbench.
>
> Nit, this really should have:
>
> Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx>
>
> I personally don't care about the attribution, but (a) others often do care and
> (b) the added context is helpful. E.g. for bad/questionable suggestsions/ideas,
> knowing that person X was also involved helps direct and/or curate questions/comments
> accordingly.

For sure! I will also pay more attention to that in the future.

Thanks.
-Mingwei