Re: [PATCH] KVM: x86: Add a vCPU stat for #AC exceptions

From: Sean Christopherson
Date: Wed Apr 26 2023 - 12:56:08 EST


On Wed, Apr 26, 2023, Anselm Busse wrote:
> This patch adds a KVM vCPU stat that reflects the number of #AC
> exceptions caused by a guest. This improves the identification and
> debugging of issues that are possibly caused by guests triggering
> split-locks and allows more insides compared to the current situation
> of having only a warning printed when an #AC exception is raised.

Irrespective of the inaccuracy Xiaoyao pointed out, I don't want to add a one-off
stat for _any_ exception. I agree with what Marc said[*] when we (Google / GCP)
tried to push our pile o' stats upstream:

: Because I'm pretty sure that whatever stat we expose, every cloud
: vendor will want their own variant, so we may just as well put the
: matter in their own hands.

That doesn't mean I don't want a massive pile of stats about all things KVM, quite
the opposite, but I don't think they belong in upstream where KVM has to maintain
them in perpetuity. E.g. at some point in the (distant) future, split-lock #AC will
be completely uninteresting because all software will have been updated/fixed.

FWIW, we looked at using eBPF for our out-of-tree stats and ultimately decided that
carrying patches to add our stats would be significantly easier to maintain than an
eBPF-based approach, e.g. rebasing this patch is trivial. But the challenges we
anticipated with switching to eBPF were largely specific to running at scale. eBPF
is a very viable approach for gathering information for debug, development,
individual users, etc.

On idea I had for easing the pain of out-of-tree stats was to clean up KVM x86's
tracepoints, e.g. to give eBPF programs more stable and useful hooks, but also to
allow CSPs like us to play macro games to "inject" stats at key points, e.g. add
infrastructure to #define overload tracepoints to make KVM trampoline through
out-of-tree stats code. But we haven't pursued that idea because (a) as above,
carrying patches for out-of-tree stats requires minimal effort and (b) it wouldn't
eliminate "invasive" code because we'd (GCP) inevitably want stats in places where
a KVM tracepoint makes no sense.

So as much as I advocate for pushing code upstream, this is one of the few areas
where I think it's better to carry code out-of-tree.

[*] https://lore.kernel.org/all/875yusv3vm.wl-maz@xxxxxxxxxx