Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width

From: Mingwei Zhang
Date: Wed Aug 23 2023 - 14:29:03 EST


On Tue, Aug 22, 2023 at 2:30 AM Like Xu <like.xu.linux@xxxxxxxxx> wrote:
>
> On 1/7/2023 5:26 am, Sean Christopherson wrote:
> > Ugh, yeah, de0f619564f4 created a bit of a mess. The underlying issue that it
> > was solving is that perf_event_read_value() and friends might sleep (yay mutex),
> > and so can't be called from KVM's fastpath (IRQs disabled).
> Updating pmu counters for emulated instructions cause troubles.
>
> >
> > However, detecting overflow requires reading perf_event_read_value() to gather
> > the accumulated count from the hardware event in order to add it to the emulated
> > count from software. E.g. if pmc->counter is X and the perf event counter is Y,
> > KVM needs to factor in Y because X+Y+1 might overflow even if X+1 does not.
> >
> > Trying to snapshot the previous counter value is a bit of a mess. It could probably
> > made to work, but it's hard to reason about what the snapshot actually contains
> > and when it should be cleared, especially when factoring in the wrapping logic.
> >
> > Rather than snapshot the previous counter, I think it makes sense to:
> >
> > 1) Track the number of emulated counter events
>
> If events are counted separately, the challenge here is to correctly time
> the emulation of counter overflows, which can occur on both sides of the
> counter values out of sync.
>
> > 2) Accumulate and reset the counts from perf_event and emulated_counter into
> > pmc->counter when pausing the PMC
> > 3) Pause and reprogram the PMC on writes (instead of the current approach of
> > blindly updating the sample period)
>
> Updating the sample period is the only interface for KVM to configure hw
> behaviour on hw-ctr. I note that perf_event_set_count() will be proposed,
> and I'm pessimistic about this change.
>
> > 4) Pause the counter when stopping the perf_event to ensure pmc->counter is
> > fresh (instead of manually updating pmc->counter)
> >
> > IMO, that yields more intuitive logic, and makes it easier to reason about
> > correctness since the behavior is easily define: pmc->counter holds the counts
> > that have been gathered and processed, perf_event and emulated_counter hold
> > outstanding counts on top. E.g. on a WRMSR to the counter, both the emulated
> > counter and the hardware counter are reset, because whatever counts existed
> > previously are irrelevant.
>
> If we take the hardware view, a counter, emulated or not, just increments
> and overflows at the threshold. The missing logic here is when the counter
> is truncated when writing high bit-width values, and how to deal with the
> value of pmc->prev_counter was before pmc->counter was truncated.
>
> >
> > Pausing the counter_might_ make WRMSR slower, but we need to get this all
> > functionally correct before worrying too much about performance.
>
> Performance, security and correctness should all be considered at the beginning.
>

+1 on the performance part.

I did several rounds of performance testing, pausing the counter is
fast, but restarting the counter is *super* slow. The extra overhead
might just make vPMU useless especially when the guest workload takes
full CPU/memory resources in a VM (like SPEC2017 does).

> >
> > Diff below for what I'm thinking (needs to be split into multiple patches). It's
> > *very* lightly tested.
>
> It saddens me that no one has come up with an actual low-level counter-test for
> this issue.
>
> >
> > I'm about to disappear for a week, I'll pick this back up when I get return. In
> > the meantime, any testing and/or input would be much appreciated!
>
> How about accepting Roman's original fix and then exercising the rewriting genius ?

+1

I think the best option would be to just apply the fix in a short term
and put the refactoring of the emulated counter in the next series.