Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

From: Sean Christopherson
Date: Fri Dec 15 2023 - 11:38:32 EST


On Fri, Dec 15, 2023, Joel Fernandes wrote:
> Hi Sean,
> Nice to see your quick response to the RFC, thanks. I wanted to
> clarify some points below:
>
> On Thu, Dec 14, 2023 at 3:13 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> > > On Thu, Dec 14, 2023 at 11:38 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > Now when I think about it, the implementation seems to
> > > suggest that we are putting policies in kvm. Ideally, the goal is:
> > > - guest scheduler communicates the priority requirements of the workload
> > > - kvm applies the priority to the vcpu task.
> >
> > Why? Tasks are tasks, why does KVM need to get involved? E.g. if the problem
> > is that userspace doesn't have the right knobs to adjust the priority of a task
> > quickly and efficiently, then wouldn't it be better to solve that problem in a
> > generic way?
>
> No, it is not only about tasks. We are boosting anything RT or above
> such as softirq, irq etc as well.

I was talking about the host side of things. A vCPU is a task, full stop. KVM
*may* have some information that is useful to the scheduler, but KVM does not
*need* to initiate adjustments to a vCPU's priority.

> Could you please see the other patches?

I already have, see my comments about boosting vCPUs that have received NMIs and
IRQs not necessarily being desirable.

> Also, Vineeth please make this clear in the next revision.
>
> > > > Pushing the scheduling policies to host userspace would allow for far more control
> > > > and flexibility. E.g. a heavily paravirtualized environment where host userspace
> > > > knows *exactly* what workloads are being run could have wildly different policies
> > > > than an environment where the guest is a fairly vanilla Linux VM that has received
> > > > a small amount of enlightment.
> > > >
> > > > Lastly, if the concern/argument is that userspace doesn't have the right knobs
> > > > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> > > > tailor made for the problems you are trying to solve.
> > > >
> > > > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
> > > >
> > > You are right, sched_ext is a good choice to have policies
> > > implemented. In our case, we would need a communication mechanism as
> > > well and hence we thought kvm would work best to be a medium between
> > > the guest and the host.
> >
> > Making KVM be the medium may be convenient and the quickest way to get a PoC
> > out the door, but effectively making KVM a middle-man is going to be a huge net
> > negative in the long term. Userspace can communicate with the guest just as
> > easily as KVM, and if you make KVM the middle-man, then you effectively *must*
> > define a relatively rigid guest/host ABI.
>
> At the moment, the only ABI is a shared memory structure and a custom
> MSR. This is no different from the existing steal time accounting
> where a shared structure is similarly shared between host and guest,
> we could perhaps augment that structure with other fields instead of
> adding a new one?

I'm not concerned about the number of structures/fields, it's the amount/type of
information and the behavior of KVM that is problematic. E.g. boosting the priority
of a vCPU that has a pending NMI is dubious. This RFC has baked in a large
number of assumptions that (mostly) fit your specific use case, but do not
necessarily apply to all use cases. I'm not even remotely convinced that what's
prosed here is optimal for your use case either.

> On the ABI point, we have deliberately tried to keep it simple (for example,
> a few months ago we had hypercalls and we went to great lengths to eliminate
> those).

Which illustrates one of the points I'm trying to make is kind of my point.
Upstream will never accept anything that's wildly complex or specific because such
a thing is unlikely to be maintainable. And so we'll end up with functionality
in KVM that is moderately helpful, but doesn't fully solve things and doesn't have
legs to go anywhere precisely because the specificity and/or complexity required
to eke out maximum performance will never be accepted.

> > If instead the contract is between host userspace and the guest, the ABI can be
> > much more fluid, e.g. if you (or any setup) can control at least some amount of
> > code that runs in the guest
>
> I see your point of view. One way to achieve this is to have a BPF
> program run to implement the boosting part, in the VMEXIT path. KVM
> then just calls a hook. Would that alleviate some of your concerns?

Yes, it absolutely would! I would *love* to build a rich set of BPF utilities
and whatnot for KVM[1]. I have zero objections to KVM making data available to
BPF programs, i.e. to host userspace, quite the opposite. What I am steadfastedly
against is KVM make decisions that are not obviously the "right" decisions in all
situations. And I do not want to end up in a world where KVM has a big pile of
knobs to let userspace control those decisions points, i.e. I don't want to make
KVM a de facto paravirt scheduler.

I think there is far more potential for this direction. KVM already has hooks
for VM-Exit and VM-Entry, they likely just need to be enhanced to make them more
useful for BPF programs. And adding hooks in other paths shouldn't be too
contentious, e.g. in addition to what you've done in this RFC, adding a hook to
kvm_vcpu_on_spin() could be quite interesting as I would not be at all surprised
if userspace could make far better decisions when selecting the vCPU to yield to.

And there are other use cases for KVM making "interesting" data available to
userspace, e.g. there's (very early) work[2] to allow userspace to poll() on vCPUs,
which likely needs much of the same information that paravirt scheduling would
find useful, e.g. whether or not the vCPU has pending events.

[1] https://lore.kernel.org/all/ZRIf1OPjKV66Y17%2F@xxxxxxxxxx
[2] https://lore.kernel.org/all/ZR9gATE2NSOOhedQ@xxxxxxxxxx

> > then the contract between the guest and host doesn't
> > even need to be formally defined, it could simply be a matter of bundling host
> > and guest code appropriately.
> >
> > If you want to land support for a given contract in upstream repositories, e.g.
> > to broadly enable paravirt scheduling support across a variety of usersepace VMMs
> > and/or guests, then yeah, you'll need a formal ABI. But that's still not a good
> > reason to have KVM define the ABI. Doing it in KVM might be a wee bit easier because
> > it's largely just a matter of writing code, and LKML provides a centralized channel
> > for getting buyin from all parties. But defining an ABI that's independent of the
> > kernel is absolutely doable, e.g. see the many virtio specs.
> >
> > I'm not saying KVM can't help, e.g. if there is information that is known only
> > to KVM, but the vast majority of the contract doesn't need to be defined by KVM.
>
> The key to making this working of the patch is VMEXIT path, that is
> only available to KVM. If we do anything later, then it might be too
> late.

Strictly speaking, no, it's not. It's key if and only if *KVM* boosts the priority
of the task/vCPU (or if KVM provides a hook for a BPF program to do its thing).

> We have to intervene *before* the scheduler takes the vCPU thread off the
> CPU.

If the host scheduler is directly involved in the paravirt shenanigans, then
there is no need to hook KVM's VM-Exit path because the scheduler already has the
information needed to make an informed decision.

> Similarly, in the case of an interrupt injected into the guest, we have
> to boost the vCPU before the "vCPU run" stage -- anything later might be too
> late.

Except that this RFC doesn't actually do this. KVM's relevant function names suck
and aren't helping you, but these patches boost vCPUs when events are *pended*,
not when they are actually injected.

Now, maybe boosting vCPUs with pending events is actually desirable, but that is
precisely the type of policy making that I do not want to have in KVM. Take the
existing kvm_vcpu_on_spin() path for example. It's a relatively simple idea, and
has no major downsides since KVM has very high confidence that the current vCPU
is spinning, i.e. waiting on something and thus doing nothing useful.

But even that path has a non-trivial amount of subtle, delicate logic to improve
the probability that KVM yields to a vCPU that can actually make forward progress.

Boosting the priority of vCPUs at semi-arbitrary points is going to be much more
difficult for KVM to "get right". E.g. why boost the priority of a vCPU that has
a pending IRQ, but not a vCPU that is running with IRQs disabled? The potential
for endless twiddling to try and tune KVM's de facto scheduling logic so that it's
universally "correct" is what terrifies me.

> Also you mentioned something about the tick path in the other email,
> we have no control over the host tick preempting the vCPU thread. The
> guest *will VMEXIT* on the host tick. On ChromeOS, we run multiple VMs
> and overcommitting is very common especially on devices with smaller
> number of CPUs.
>
> Just to clarify, this isn't a "quick POC". We have been working on
> this for many months and it was hard to get working correctly and
> handle all corner cases. We are finally at a point where - it just
> works (TM) and is roughly half the code size of when we initially
> started.