Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

From: Sean Christopherson
Date: Fri Dec 15 2023 - 17:08:58 EST


On Fri, Dec 15, 2023, Joel Fernandes wrote:
> On Fri, Dec 15, 2023 at 11:38 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Fri, Dec 15, 2023, Joel Fernandes wrote:
> > > Hi Sean,
> > > Nice to see your quick response to the RFC, thanks. I wanted to
> > > clarify some points below:
> > >
> > > On Thu, Dec 14, 2023 at 3:13 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > >
> > > > On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> > > > > On Thu, Dec 14, 2023 at 11:38 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > > > Now when I think about it, the implementation seems to
> > > > > suggest that we are putting policies in kvm. Ideally, the goal is:
> > > > > - guest scheduler communicates the priority requirements of the workload
> > > > > - kvm applies the priority to the vcpu task.
> > > >
> > > > Why? Tasks are tasks, why does KVM need to get involved? E.g. if the problem
> > > > is that userspace doesn't have the right knobs to adjust the priority of a task
> > > > quickly and efficiently, then wouldn't it be better to solve that problem in a
> > > > generic way?
> > >
> > > No, it is not only about tasks. We are boosting anything RT or above
> > > such as softirq, irq etc as well.
> >
> > I was talking about the host side of things. A vCPU is a task, full stop. KVM
> > *may* have some information that is useful to the scheduler, but KVM does not
> > *need* to initiate adjustments to a vCPU's priority.
>
> Sorry I thought you were referring to guest tasks. You are right, KVM
> does not *need* to change priority. But a vCPU is a container of tasks
> who's priority dynamically changes. Still, I see your point of view
> and that's also why we offer the capability to be selectively enabled
> or disabled per-guest by the VMM (Vineeth will make it default off and
> opt-in in the next series).
>
> > > Could you please see the other patches?
> >
> > I already have, see my comments about boosting vCPUs that have received
> > NMIs and IRQs not necessarily being desirable.
>
> Ah, I was not on CC for that email. Seeing it now. I think I don't
> fully buy that argument, hard IRQs are always high priority IMHO.

They most definitely are not, and there are undoubtedly tiers of priority, e.g.
tiers are part and parcel of the APIC architecture. I agree that *most* IRQs are
high-ish priority, but that is not the same that *all* IRQs are high priority.
It only takes one example to disprove the latter, and I can think of several off
the top of my head.

Nested virtualization is the easy one to demonstrate.

On AMD, which doesn't have an equivalent to the VMX preemption timer, KVM uses a
self-IPI to wrest control back from the guest immediately after VMRUN in some
situations (mostly to inject events into L2 while honoring the architectural
priority of events). If the guest is running a nested workload, the resulting
IRQ in L1 is not at all interesting or high priority, as the L2 workload hasn't
suddenly become high priority just because KVM wants to inject an event.

Anyways, I didn't mean to start a debate over the priority of handling IRQs and
NMIs, quite the opposite actually. The point I'm trying to make is that under
no circumstance do I want KVM to be making decisions about whether or not such
things are high priority. I have no objection to KVM making information available
to whatever entity is making the actual decisions, it's having policy in KVM that
I am staunchly opposed to.

> If an hrtimer expires on a CPU running a low priority workload, that
> hrtimer might itself wake up a high priority thread. If we don't boost
> the hrtimer interrupt handler, then that will delay the wakeup as
> well. It is always a chain of events and it has to be boosted from the
> first event. If a system does not wish to give an interrupt a high
> priority, then the typical way is to use threaded IRQs and lower the
> priority of the thread. That will give the interrupt handler lower
> priority and the guest is free to do that. We had many POCs before
> where we don't boost at all for interrupts and they all fall apart.
> This is the only POC that works without any issues as far as we know
> (we've been trying to do this for a long time :P).

In *your* environment. The fact that it took multiple months to get a stable,
functional set of patches for one use case is *exactly* why I am pushing back on
this. Change any number of things about the setup and odds are good that the
result would need different tuning. E.g. the ratio of vCPUs to pCPUs, the number
of VMs, the number of vCPUs per VM, whether or not the host kernel is preemptible,
whether or not the guest kernel is preemptible, the tick rate of the host and
guest kernels, the workload of the VM, the affinity of tasks within the VM, and
and so on and so forth.

It's a catch-22 of sorts. Anything that is generic enough to land upstream is
likely going to be too coarse grained to be universally applicable.

> Regarding perf, I similarly disagree. I think a PMU event is super
> important (example, some versions of the kernel watchdog that depend
> on PMU fail without it). But if some VM does not want this to be
> prioritized, they could just not opt-in for the feature IMO. I can see
> your point of view that not all VMs may want this behavior though.

Or a VM may want it conditionally, e.g. only for select tasks.

> > > At the moment, the only ABI is a shared memory structure and a custom
> > > MSR. This is no different from the existing steal time accounting
> > > where a shared structure is similarly shared between host and guest,
> > > we could perhaps augment that structure with other fields instead of
> > > adding a new one?
> >
> > I'm not concerned about the number of structures/fields, it's the amount/type of
> > information and the behavior of KVM that is problematic. E.g. boosting the priority
> > of a vCPU that has a pending NMI is dubious.
>
> I think NMIs have to be treated as high priority, the perf profiling
> interrupt for instance works well on x86 (unlike ARM) because it can
> interrupt any context (other than NMI and possibly the machine check
> ones). On ARM on the other hand, because the perf interrupt is a
> regular non-NMI interrupt, you cannot profile hardirq and IRQ-disable
> regions (this could have changed since pseudo-NMI features). So making
> the NMI a higher priority than IRQ is not dubious AFAICS, it is a
> requirement in many cases IMHO.

Again, many, but not all. A large part of KVM's success is that KVM has very few
"opinions" of its own. Outside of the MMU and a few paravirt paths, KVM mostly
just emulates/virtualizes hardware according to the desires of userspace. This
has allowed a fairly large variety of use cases to spring up with relatively few
changes to KVM.

What I want to avoid is (a) adding something that only works for one use case
and (b) turning KVM into a scheduler of any kind.

> > Which illustrates one of the points I'm trying to make is kind of my point.
> > Upstream will never accept anything that's wildly complex or specific because such
> > a thing is unlikely to be maintainable.
>
> TBH, it is not that complex though.

Yet. Your use case is happy with relatively simple, coarse-grained hooks. Use
cases that want to squeeze out maximum performance, e.g. because shaving N% off
the runtime saves $$$, are likely willing to take on far more complexity, or may
just want to make decisions at a slightly different granularity.

> But let us know which parts, if any, can be further simplified (I saw your
> suggestions for next steps in the reply to Vineeth, those look good to me and
> we'll plan accordingly).

It's not a matter of simplifying things, it's a matter of KVM (a) not defining
policy of any kind and (b) KVM not defining a guest/host ABI.

> > > We have to intervene *before* the scheduler takes the vCPU thread off the
> > > CPU.
> >
> > If the host scheduler is directly involved in the paravirt shenanigans, then
> > there is no need to hook KVM's VM-Exit path because the scheduler already has the
> > information needed to make an informed decision.
>
> Just to clarify, we're not making any "decisions" in the VM exit path,

Yes, you are.

> we're just giving the scheduler enough information (via the
> setscheduler call). The scheduler may just as well "decide" it wants
> to still preempt the vCPU thread -- that's Ok in fact required some
> times. We're just arming it with more information, specifically that
> this is an important thread. We can find another way to pass this
> information along (BPF etc) but I just wanted to mention that KVM is
> not really replacing the functionality or decision-making of the
> scheduler even with this POC.

Yes, it is. kvm_vcpu_kick_boost() *directly* adjusts the priority of the task.
KVM is not just passing a message, KVM is defining a scheduling policy of "boost
vCPUs with pending IRQs, NMIs, SMIs, and PV unhalt events".

The VM-Exit path also makes those same decisions by boosting a vCPU if the guest
has requested boost *or* the vCPU has a pending event (but oddly, not pending
NMIs, SMIs, or PV unhalt events):

bool pending_event = kvm_cpu_has_pending_timer(vcpu) || kvm_cpu_has_interrupt(vcpu);

/*
* vcpu needs a boost if
* - A lazy boost request active or a pending latency sensitive event, and
* - Preemption disabled duration on this vcpu has not crossed the threshold.
*/
return ((schedinfo.boost_req == VCPU_REQ_BOOST || pending_event) &&
!kvm_vcpu_exceeds_preempt_disabled_duration(&vcpu->arch));


Which, by the by is suboptimal. Detecting for pending events isn't free, so you
ideally want to check for pending events if and only if the guest hasn't requested
a boost.

> > > Similarly, in the case of an interrupt injected into the guest, we have
> > > to boost the vCPU before the "vCPU run" stage -- anything later might be too
> > > late.
> >
> > Except that this RFC doesn't actually do this. KVM's relevant function names suck
> > and aren't helping you, but these patches boost vCPUs when events are *pended*,
> > not when they are actually injected.
>
> We are doing the injection bit in:
> https://lore.kernel.org/all/20231214024727.3503870-5-vineeth@xxxxxxxxxxxxxxx/
>
> For instance, in:
>
> kvm_set_msi ->
> kvm_irq_delivery_to_apic ->
> kvm_apic_set_irq ->
> __apic_accept_irq ->
> kvm_vcpu_kick_boost();
>
> The patch is a bit out of order because patch 4 depends on 3. Patch 3
> does what you're referring to, which is checking for pending events.
>
> Did we miss something? If there is some path that we are missing, your
> help is much appreciated as you're likely much more versed with this
> code than me. But doing the boosting at injection time is what has
> made all the difference (for instance with cyclictest latencies).

That accepts in IRQ into the vCPU's local APIC, it does not *inject* the IRQ into
the vCPU proper. The actual injection is done by kvm_check_and_inject_events().
A pending IRQ is _usually_ delivered/injected fairly quickly, but not always.

E.g. if the guest has IRQs disabled (RFLAGS.IF=0), KVM can't immediately inject
the IRQ (without violating x86 architecture). In that case, KVM will twiddle
VMCS/VMCB knobs to detect an IRQ window, i.e. to cause a VM-Exit when IRQs are
no longer blocked in the guest.

Your PoC actually (mostly) handles this (see above) by keeping the vCPU boosted
after EXIT_REASON_INTERRUPT_WINDOW (because the IRQ will obviously still be pending).

> > Boosting the priority of vCPUs at semi-arbitrary points is going to be much more
> > difficult for KVM to "get right". E.g. why boost the priority of a vCPU that has
> > a pending IRQ, but not a vCPU that is running with IRQs disabled?
>
> I was actually wanting to boost preempted vCPU threads that were
> preempted in IRQ disabled regions as well. In fact, that is on our
> TODO.. we just haven't done it yet as we notice that usually IRQ is
> disabled while preemption was already disabled and just boosting
> preempt-disabled gets us most of the way there.
>
> > The potential for endless twiddling to try and tune KVM's de facto
> > scheduling logic so that it's universally "correct" is what terrifies me.
>
> Yes, we can certainly look into BPF to make it a bit more generic for
> our usecase (while getting enough information from the kernel).
>
> By the way, one other usecase for this patch series is RCU. I am one
> of the RCU maintainers and I am looking into improving RCU in VMs. For
> instance, boosting preempted RCU readers to RT (CONFIG_RCU_BOOST) does
> not usually work because the vCPU thread on the host is not RT.
> Similarly, other RCU threads which have RT priority don't get to run
> causing issues.
>
> Thanks!
>
> - Joel