Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

From: Sean Christopherson
Date: Thu Dec 14 2023 - 19:48:02 EST


On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> On Thu, Dec 14, 2023 at 3:13 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Thu, Dec 14, 2023, Vineeth Remanan Pillai wrote:
> > > On Thu, Dec 14, 2023 at 11:38 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > Now when I think about it, the implementation seems to
> > > suggest that we are putting policies in kvm. Ideally, the goal is:
> > > - guest scheduler communicates the priority requirements of the workload
> > > - kvm applies the priority to the vcpu task.
> >
> > Why? Tasks are tasks, why does KVM need to get involved? E.g. if the problem
> > is that userspace doesn't have the right knobs to adjust the priority of a task
> > quickly and efficiently, then wouldn't it be better to solve that problem in a
> > generic way?
> >
> I get your point. A generic way would have been more preferable, but I
> feel the scenario we are tackling is a bit more time critical and kvm
> is better equipped to handle this. kvm has control over the VM/vcpu
> execution and hence it can take action in the most effective way.

No, KVM most definitely does not. Between sched, KVM, and userspace, I would
rank KVM a very distant third. Userspace controls when to do KVM_RUN, to which
cgroup(s) a vCPU task is assigned, the affinity of the task, etc. sched decides
when and where to run a vCPU task based on input from userspace.

Only in some edge cases that are largely unique to overcommitted CPUs does KVM
have any input on scheduling whatsoever. And even then, KVM's view is largely
limited to a single VM, e.g. teaching KVM to yield to a vCPU running in a different
VM would be interesting, to say the least.

> One example is the place where we handle boost/unboost. By the time
> you come out of kvm to userspace it would be too late.

Making scheduling decisions in userspace doesn't require KVM to exit to userspace.
It doesn't even need to require a VM-Exit to KVM. E.g. if the scheduler (whether
it's in kernel or userspace) is running on a different logical CPU(s), then there's
no need to trigger a VM-Exit because the scheduler can incorporate information
about a vCPU in real time, and interrupt the vCPU if and only if something else
needs to run on that associated CPU. From the sched_ext cover letter:

: Google has also experimented with some promising, novel scheduling policies.
: One example is “central” scheduling, wherein a single CPU makes all
: scheduling decisions for the entire system. This allows most cores on the
: system to be fully dedicated to running workloads, and can have significant
: performance improvements for certain use cases. For example, central
: scheduling with VCPUs can avoid expensive vmexits and cache flushes, by
: instead delegating the responsibility of preemption checks from the tick to
: a single CPU. See scx_central.bpf.c for a simple example of a central
: scheduling policy built in sched_ext.

> Currently we apply the boost soon after VMEXIT before enabling preemption so
> that the next scheduler entry will consider the boosted priority. As soon as
> you enable preemption, the vcpu could be preempted and boosting would not
> help when it is boosted. This timing correctness is very difficult to achieve
> if we try to do it in userland or do it out-of-band.

Hooking VM-Exit isn't necessarily the fastest and/or best time to make scheduling
decisions about vCPUs. Presumably the whole point of this is to allow running
high priority, latency senstive workloads in the guest. As above, the ideal scenario
is that a vCPU running a high priority workload would never exit in the first place.

Is it easy to get there? No. But it's definitely possible.

> [...snip...]
> > > > Lastly, if the concern/argument is that userspace doesn't have the right knobs
> > > > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> > > > tailor made for the problems you are trying to solve.
> > > >
> > > > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
> > > >
> > > You are right, sched_ext is a good choice to have policies
> > > implemented. In our case, we would need a communication mechanism as
> > > well and hence we thought kvm would work best to be a medium between
> > > the guest and the host.
> >
> > Making KVM be the medium may be convenient and the quickest way to get a PoC
> > out the door, but effectively making KVM a middle-man is going to be a huge net
> > negative in the long term. Userspace can communicate with the guest just as
> > easily as KVM, and if you make KVM the middle-man, then you effectively *must*
> > define a relatively rigid guest/host ABI.
> >
> > If instead the contract is between host userspace and the guest, the ABI can be
> > much more fluid, e.g. if you (or any setup) can control at least some amount of
> > code that runs in the guest, then the contract between the guest and host doesn't
> > even need to be formally defined, it could simply be a matter of bundling host
> > and guest code appropriately.
> >
> > If you want to land support for a given contract in upstream repositories, e.g.
> > to broadly enable paravirt scheduling support across a variety of usersepace VMMs
> > and/or guests, then yeah, you'll need a formal ABI. But that's still not a good
> > reason to have KVM define the ABI. Doing it in KVM might be a wee bit easier because
> > it's largely just a matter of writing code, and LKML provides a centralized channel
> > for getting buyin from all parties. But defining an ABI that's independent of the
> > kernel is absolutely doable, e.g. see the many virtio specs.
> >
> > I'm not saying KVM can't help, e.g. if there is information that is known only
> > to KVM, but the vast majority of the contract doesn't need to be defined by KVM.
> >
> As you mentioned, custom contract between guest and host userspace is
> really flexible, but I believe tackling scheduling(especially latency)
> issues is a bit more difficult with generic approaches. Here kvm does
> have some information known only to kvm(which could be shared - eg:
> interrupt injection) but more importantly kvm has some unique
> capabilities when it comes to scheduling. kvm and scheduler are
> cooperating currently for various cases like, steal time accounting,
> vcpu preemption state, spinlock handling etc. We could possibly try to
> extend it a little further in a non-intrusive way.

I'm not too worried about the code being intrusive, I'm worried about the
maintainability, longevity, and applicability of this approach.

IMO, this has a significantly lower ceiling than what is possible with something
like sched_ext, e.g. it requires a host tick to make scheduling decisions, and
because it'd require a kernel-defined ABI, would essentially be limited to knobs
that are broadly useful. I.e. every bit of information that you want to add to
the guest/host ABI will need to get approval from at least the affected subsystems
in the guest, from KVM, and possibly from the host scheduler too. That's going
to make for a very high bar.

> Having a formal paravirt scheduling ABI is something we would want to
> pursue (as I mentioned in the cover letter) and this could help not
> only with latencies, but optimal task placement for efficiency, power
> utilization etc. kvm's role could be to set the stage and share
> information with minimum delay and less resource overhead.

Making KVM middle-man is most definitely not going to provide minimum delay or
overhead. Minimum delay would be the guest directly communicating with the host
scheduler. I get that convincing the sched folks to add a bunch of paravirt
stuff is a tall order (for very good reason), but that's exactly why I Cc'd the
sched_ext folks.

> We could use schedulers (vanilla, sched_ext, ...) to actually make decisions
> based on the information it receives.