Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width

From: Jim Mattson
Date: Fri Jun 30 2023 - 13:16:45 EST


On Fri, Jun 30, 2023 at 10:08 AM Mingwei Zhang <mizhang@xxxxxxxxxx> wrote:
>
> On Fri, Jun 30, 2023 at 8:45 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > On Fri, Jun 30, 2023 at 07:28:29AM -0700, Sean Christopherson wrote:
> > > > On Fri, Jun 30, 2023, Roman Kagan wrote:
> > > > > On Thu, Jun 29, 2023 at 05:11:06PM -0700, Sean Christopherson wrote:
> > > > > > @@ -74,6 +74,14 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > > > > > return counter & pmc_bitmask(pmc);
> > > > > > }
> > > > > >
> > > > > > +static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> > > > > > +{
> > > > > > + if (pmc->perf_event && !pmc->is_paused)
> > > > > > + perf_event_set_count(pmc->perf_event, val);
> > > > > > +
> > > > > > + pmc->counter = val;
> > > > >
> > > > > Doesn't this still have the original problem of storing wider value than
> > > > > allowed?
> > > >
> > > > Yes, this was just to fix the counter offset weirdness. My plan is to apply your
> > > > patch on top. Sorry for not making that clear.
> > >
> > > Ah, got it, thanks!
> > >
> > > Also I'm now chasing a problem that we occasionally see
> > >
> > > [3939579.462832] Uhhuh. NMI received for unknown reason 30 on CPU 43.
> > > [3939579.462836] Do you have a strange power saving mode enabled?
> > > [3939579.462836] Dazed and confused, but trying to continue
> > >
> > > in the guests when perf is used. These messages disappear when
> > > 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") is
> > > reverted. I haven't yet figured out where exactly the culprit is.
> >
> > Can you reverting de0f619564f4 ("KVM: x86/pmu: Defer counter emulated overflow
> > via pmc->prev_counter")? I suspect the problem is the prev_counter mess.
>
> For sure it is prev_counter issue, I have done some instrumentation as follows:
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 48a0528080ab..946663a42326 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -322,8 +322,11 @@ static void reprogram_counter(struct kvm_pmc *pmc)
> if (!pmc_event_is_allowed(pmc))
> goto reprogram_complete;
>
> - if (pmc->counter < pmc->prev_counter)
> + if (pmc->counter < pmc->prev_counter) {
> + pr_info("pmc->counter: %llx\tpmc->prev_counter: %llx\n",
> + pmc->counter, pmc->prev_counter);
> __kvm_perf_overflow(pmc, false);
> + }
>
> if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
> printk_once("kvm pmu: pin control bit is ignored\n");
>
> I find some interesting changes on prev_counter:
>
> [ +7.295348] pmc->counter: 12 pmc->prev_counter: fffffffffb3d
> [ +0.622991] pmc->counter: 3 pmc->prev_counter: fffffffffb1a
> [ +6.943282] pmc->counter: 1 pmc->prev_counter: fffffffff746
> [ +4.483523] pmc->counter: 0 pmc->prev_counter: ffffffffffff
> [ +12.817772] pmc->counter: 0 pmc->prev_counter: ffffffffffff
> [ +21.721233] pmc->counter: 0 pmc->prev_counter: ffffffffffff
>
> The first 3 logs will generate this:
>
> [ +11.811925] Uhhuh. NMI received for unknown reason 20 on CPU 2.
> [ +0.000003] Dazed and confused, but trying to continue
>
> While the last 3 logs won't. This is quite reasonable as looking into
> de0f619564f4 ("KVM: x86/pmu: Defer counter emulated overflow via
> pmc->prev_counter"), counter and prev_counter should only have 1 diff
> in value.

prev_counter isn't actually sync'ed at this point, is it? This comes
back to that "setting a running counter" nonsense. We want to add 1 to
the current counter, but we don't actually know what the current
counter is.

My interpretation of the above is that, in the first three cases, PMU
hardware has already detected an overflow. In the last three cases,
software counting has detected an overflow.

If the last three occur while executing the guest's PMI handler (i.e.
NMIs are blocked), then this could corroborate my conjecture about
IA32_DEBUGCTL.Freeze_PerfMon_On_PMI.

> So, the reasonable suspect should be the stale prev_counter. There
> might be several potential reasons behind this. Jim's theory is the
> highly reasonable one as I did another experiment and found that KVM
> may leave pmu->global_status as '0' when injecting an NMI.