Re: [PATCH kvm/queue v2 2/3] perf: x86/core: Add interface to query perfmon_event_map[] directly

From: Jim Mattson
Date: Mon Feb 14 2022 - 17:56:03 EST


On Mon, Feb 14, 2022 at 1:55 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>
>
>
> On 2/12/2022 6:31 PM, Jim Mattson wrote:
> > On Fri, Feb 11, 2022 at 1:47 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 2/11/2022 1:08 PM, Jim Mattson wrote:
> >>> On Fri, Feb 11, 2022 at 6:11 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2/10/2022 2:55 PM, David Dunn wrote:
> >>>>> Kan,
> >>>>>
> >>>>> On Thu, Feb 10, 2022 at 11:46 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
> >>>>>
> >>>>>> No, we don't, at least for Linux. Because the host own everything. It
> >>>>>> doesn't need the MSR to tell which one is in use. We track it in an SW way.
> >>>>>>
> >>>>>> For the new request from the guest to own a counter, I guess maybe it is
> >>>>>> worth implementing it. But yes, the existing/legacy guest never check
> >>>>>> the MSR.
> >>>>>
> >>>>> This is the expectation of all software that uses the PMU in every
> >>>>> guest. It isn't just the Linux perf system.
> >>>>>
> >>>>> The KVM vPMU model we have today results in the PMU utilizing software
> >>>>> simply not working properly in a guest. The only case that can
> >>>>> consistently "work" today is not giving the guest a PMU at all.
> >>>>>
> >>>>> And that's why you are hearing requests to gift the entire PMU to the
> >>>>> guest while it is running. All existing PMU software knows about the
> >>>>> various constraints on exactly how each MSR must be used to get sane
> >>>>> data. And by gifting the entire PMU it allows that software to work
> >>>>> properly. But that has to be controlled by policy at host level such
> >>>>> that the owner of the host knows that they are not going to have PMU
> >>>>> visibility into guests that have control of PMU.
> >>>>>
> >>>>
> >>>> I think here is how a guest event works today with KVM and perf subsystem.
> >>>> - Guest create an event A
> >>>> - The guest kernel assigns a guest counter M to event A, and config
> >>>> the related MSRs of the guest counter M.
> >>>> - KVM intercepts the MSR access and create a host event B. (The
> >>>> host event B is based on the settings of the guest counter M. As I said,
> >>>> at least for Linux, some SW config impacts the counter assignment. KVM
> >>>> never knows it. Event B can only be a similar event to A.)
> >>>> - Linux perf subsystem assigns a physical counter N to a host event
> >>>> B according to event B's constraint. (N may not be the same as M,
> >>>> because A and B may have different event constraints)
> >>>>
> >>>> As you can see, even the entire PMU is given to the guest, we still
> >>>> cannot guarantee that the physical counter M can be assigned to the
> >>>> guest event A.
> >>>
> >>> All we know about the guest is that it has programmed virtual counter
> >>> M. It seems obvious to me that we can satisfy that request by giving
> >>> it physical counter M. If, for whatever reason, we give it physical
> >>> counter N isntead, and M and N are not completely fungible, then we
> >>> have failed.
> >>>
> >>>> How to fix it? The only thing I can imagine is "passthrough". Let KVM
> >>>> directly assign the counter M to guest. So, to me, this policy sounds
> >>>> like let KVM replace the perf to control the whole PMU resources, and we
> >>>> will handover them to our guest then. Is it what we want?
> >>>
> >>> We want PMU virtualization to work. There are at least two ways of doing that:
> >>> 1) Cede the entire PMU to the guest while it's running.
> >>
> >> So the guest will take over the control of the entire PMUs while it's
> >> running. I know someone wants to do system-wide monitoring. This case
> >> will be failed.
> >
> > We have system-wide monitoring for fleet efficiency, but since there's
> > nothing we can do about the efficiency of the guest (and those cycles
> > are paid for by the customer, anyway), I don't think our efficiency
> > experts lose any important insights if guest cycles are a blind spot.
>
> Others, e.g., NMI watchdog, also occupy a performance counter. I think
> the NMI watchdog is enabled by default at least for the current Linux
> kernel. You have to disable all such cases in the host when the guest is
> running.

It doesn't actually make any sense to run the NMI watchdog while in
the guest, does it?

> >
> >> I'm not sure whether you can fully trust the guest. If malware runs in
> >> the guest, I don't know whether it will harm the entire system. I'm not
> >> a security expert, but it sounds dangerous.
> >> Hope the administrators know what they are doing when choosing this policy.
> >
> > Virtual machines are inherently dangerous. :-)
> >
> > Despite our concerns about PMU side-channels, Intel is constantly
> > reminding us that no such attacks are yet known. We would probably
> > restrict some events to guests that occupy an entire socket, just to
> > be safe.
> >
> > Note that on the flip side, TDX and SEV are all about catering to
> > guests that don't trust the host. Those customers probably don't want
> > the host to be able to use the PMU to snoop on guest activity.
> >
> >>> 2) Introduce a new "ultimate" priority level in the host perf
> >>> subsystem. Only KVM can request events at the ultimate priority, and
> >>> these requests supersede any other requests.
> >>
> >> The "ultimate" priority level doesn't help in the above case. The
> >> counter M may not bring the precise which guest requests. I remember you
> >> called it "broken".
> >
> > Ideally, ultimate priority also comes with the ability to request
> > specific counters.
> >
> >> KVM can fails the case, but KVM cannot notify the guest. The guest still
> >> see wrong result.
> >>
> >>>
> >>> Other solutions are welcome.
> >>
> >> I don't have a perfect solution to achieve all your requirements. Based
> >> on my understanding, the guest has to be compromised by either
> >> tolerating some errors or dropping some features (e.g., some special
> >> events). With that, we may consider the above "ultimate" priority level
> >> policy. The default policy would be the same as the current
> >> implementation, where the host perf treats all the users, including the
> >> guest, equally. The administrators can set the "ultimate" priority level
> >> policy, which may let the KVM/guest pin/own some regular counters via
> >> perf subsystem. That's just my personal opinion for your reference.
> >
> > I disagree. The guest does not have to be compromised. For a proof of
> > concept, see VMware ESXi. Probably Microsoft Hyper-V as well, though I
> > haven't checked.
>
> As far as I know, VMware ESXi has its own VMkernel, which can owns the
> entire HW PMUs. The KVM is part of the Linux kernel. The HW PMUs should
> be shared among components/users. I think the cases are different.

Architecturally, ESXi is not very different from Linux. The VMkernel
is a posix-compliant kernel, and VMware's "vmm" is comparable to kvm.

> Also, from what I searched on the VMware website, they still encounter
> the case that a guest VM may not get a performance monitoring counter.
> It looks like their solution is to let guest OS check the availability
> of the counter, which is similar to the solution I mentioned (Use
> GLOBAL_INUSE MSR).
>
> "If an ESXi host's BIOS uses a performance counter or if Fault Tolerance
> is enabled, some virtual performance counters might not be available for
> the virtual machine to use."

I'm perfectly happy to give up PMCs on Linux under those same conditions.

> https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vm_admin.doc/GUID-F920A3C7-3B42-4E78-8EA7-961E49AF479D.html
>
> "In general, if a physical CPU PMC is in use, the corresponding virtual
> CPU PMC is not functional and is unavailable for use by the guest. Guest
> OS software detects unavailable general purpose PMCs by checking for a
> non-zero event select MSR value when a virtual machine powers on."
>
> https://kb.vmware.com/s/article/2030221
>
Linux, at least, doesn't do that. Maybe Windows does.

> Thanks,
> Kan
>