Re: [PATCH v2 4/6] KVM: x86/pmu: Add pmc->intr to refactor kvm_perf_overflow{_intr}()

From: Like Xu
Date: Thu Dec 16 2021 - 04:57:52 EST


On 13/12/2021 2:37 pm, Jim Mattson wrote:
On Sat, Dec 11, 2021 at 8:56 PM Jim Mattson <jmattson@xxxxxxxxxx> wrote:

On Fri, Dec 10, 2021 at 3:31 PM Jim Mattson <jmattson@xxxxxxxxxx> wrote:

On Fri, Dec 10, 2021 at 2:59 PM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:

On 12/10/21 23:55, Jim Mattson wrote:

Even for tracing the SDM says "Like the value returned by RDTSC, TSC
packets will include these adjustments, but other timing packets (such
as MTC, CYC, and CBR) are not impacted". Considering that "stand-alone
TSC packets are typically generated only when generation of other timing
packets (MTCs and CYCs) has ceased for a period of time", I'm not even
sure it's a good thing that the values in TSC packets are scaled and offset.

Back to the PMU, for non-architectural counters it's not really possible
to know if they count in cycles or not. So it may not be a good idea to
special case the architectural counters.

In that case, what we're doing with the guest PMU is not
virtualization. I don't know what it is, but it's not virtualization.

It's a use of profiling guest on the host side, like "perf kvm" and in that case,
we need to convert the guest's TSC values with the host view, taking into
account the guest TSC scaling.


It is virtualization even if it is incompatible with live migration to a
different SKU (where, as you point out below, multiple TSC frequencies
might also count as multiple SKUs). But yeah, it's virtualization with
more caveats than usual.

It's not virtualization if the counters don't count at the rate the
guest expects them to count.

We do have "Use TSC scaling" bit in the "Secondary Processor-Based VM-Execution Controls".


Per the SDM, unhalted reference cycles count at "a fixed frequency."
If the frequency changes on migration, then the value of this event is
questionable at best. For unhalted core cycles, on the other hand, the
SDM says, "The performance counter for this event counts across
performance state transitions using different core clock frequencies."
That does seem to permit frequency changes on migration, but I suspect
that software expects the event to count at a fixed frequency if
INVARIANT_TSC is set.

Yes, I may propose that pmu be used in conjunction with INVARIANT_TSC.


Actually, I now realize that unhalted reference cycles is independent
of the host or guest TSC, so it is not affected by TSC scaling.

I doubt it.

However, we still have to decide on a specific fixed frequency to
virtualize so that the frequency doesn't change on migration. As a
practical matter, it may be the case that the reference cycles
frequency is the same on all processors in a migration pool, and we
don't have to do anything.

Yes, someone is already doing this in a production environment.



I'm not sure that I buy your argument regarding consistency. In
general, I would expect the hypervisor to exclude non-architected
events from the allow-list for any VM instances running in a
heterogeneous migration pool. Certainly, those events could be allowed
in a heterogeneous migration pool consisting of multiple SKUs of the
same microarchitecture running at different clock frequencies, but
that seems like a niche case.

IMO, if there are users who want to use the guest PMU, they definitely
want non-architectural events, even without live migration support.

Another input is that we actually have no problem reporting erratic
performance data during live migration transactions or host power
transactions, and there are situations where users want to know
that these kind of things are happening underwater.

The software performance tuners would not trust the perf data from
a single trial, relying more on statistical conclusions.



Exposing non-architectural events is questionable with live migration,
and TSC scaling is unnecessary without live migration. I suppose you
could have a migration pool with different SKUs of the same generation
with 'seemingly compatible' PMU events but different TSC frequencies,
in which case it might be reasonable to expose non-architectural
events, but I would argue that any of those 'seemingly compatible'
events are actually not compatible if they count in cycles.
I agree. Support for marshaling/unmarshaling PMU state exists but it's
more useful for intra-host updates than for actual live migration, since
these days most live migration will use TSC scaling on the destination.

Paolo


Unless, of course, Like is right, and the PMU counters do count fractionally.