Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

From: Vineeth Remanan Pillai
Date: Fri Dec 15 2023 - 09:34:58 EST


> > >
> > I get your point. A generic way would have been more preferable, but I
> > feel the scenario we are tackling is a bit more time critical and kvm
> > is better equipped to handle this. kvm has control over the VM/vcpu
> > execution and hence it can take action in the most effective way.
>
> No, KVM most definitely does not. Between sched, KVM, and userspace, I would
> rank KVM a very distant third. Userspace controls when to do KVM_RUN, to which
> cgroup(s) a vCPU task is assigned, the affinity of the task, etc. sched decides
> when and where to run a vCPU task based on input from userspace.
>
> Only in some edge cases that are largely unique to overcommitted CPUs does KVM
> have any input on scheduling whatsoever. And even then, KVM's view is largely
> limited to a single VM, e.g. teaching KVM to yield to a vCPU running in a different
> VM would be interesting, to say the least.
>
Over committed case is exactly what we are trying to tackle. Sorry for
not making this clear in the cover letter. ChromeOS runs on low-end
devices (eg: 2C/2T cpus) and does not have enough compute capacity to
offload scheduling decisions. In-band scheduling decisions gave the
best results.

> > One example is the place where we handle boost/unboost. By the time
> > you come out of kvm to userspace it would be too late.
>
> Making scheduling decisions in userspace doesn't require KVM to exit to userspace.
> It doesn't even need to require a VM-Exit to KVM. E.g. if the scheduler (whether
> it's in kernel or userspace) is running on a different logical CPU(s), then there's
> no need to trigger a VM-Exit because the scheduler can incorporate information
> about a vCPU in real time, and interrupt the vCPU if and only if something else
> needs to run on that associated CPU. From the sched_ext cover letter:
>
> : Google has also experimented with some promising, novel scheduling policies.
> : One example is “central” scheduling, wherein a single CPU makes all
> : scheduling decisions for the entire system. This allows most cores on the
> : system to be fully dedicated to running workloads, and can have significant
> : performance improvements for certain use cases. For example, central
> : scheduling with VCPUs can avoid expensive vmexits and cache flushes, by
> : instead delegating the responsibility of preemption checks from the tick to
> : a single CPU. See scx_central.bpf.c for a simple example of a central
> : scheduling policy built in sched_ext.
>
This makes sense when the host has enough compute resources for
offloading scheduling decisions. In an over committed system, the
scheduler running out-of-band would need to get cpu time to make
decisions and starvation of scheduler may make the situation worse. We
could probably tune the priorities of the scheduler to have least
latencies, but in our experience this was not scaling due to the
nature of cpu interruptions happening in a consumer devices..

> > Currently we apply the boost soon after VMEXIT before enabling preemption so
> > that the next scheduler entry will consider the boosted priority. As soon as
> > you enable preemption, the vcpu could be preempted and boosting would not
> > help when it is boosted. This timing correctness is very difficult to achieve
> > if we try to do it in userland or do it out-of-band.
>
> Hooking VM-Exit isn't necessarily the fastest and/or best time to make scheduling
> decisions about vCPUs. Presumably the whole point of this is to allow running
> high priority, latency senstive workloads in the guest. As above, the ideal scenario
> is that a vCPU running a high priority workload would never exit in the first place.
>
> Is it easy to get there? No. But it's definitely possible.
>
Low end devices do not have the luxury of dedicating physical cpus to
vcpus and having an out-of-band scheduler also adds to the load of the
system. In this RFC, a boost request doesn't induce an immeidate
VMEXIT, but just sets a shared memory flag and continues to run. On
the very next VMEXIT, kvm checks the shared memory and passes it to
scheduler. This technique allows for avoiding extra VMEXITs for
boosting, but still uses the fast in-band scheduling mechanism to
achieve the desired results.

> > [...snip...]
> > > > > Lastly, if the concern/argument is that userspace doesn't have the right knobs
> > > > > to (quickly) boost vCPU tasks, then the proposed sched_ext functionality seems
> > > > > tailor made for the problems you are trying to solve.
> > > > >
> > > > > https://lkml.kernel.org/r/20231111024835.2164816-1-tj%40kernel.org
> > > > >
> > > > You are right, sched_ext is a good choice to have policies
> > > > implemented. In our case, we would need a communication mechanism as
> > > > well and hence we thought kvm would work best to be a medium between
> > > > the guest and the host.
> > >
> > > Making KVM be the medium may be convenient and the quickest way to get a PoC
> > > out the door, but effectively making KVM a middle-man is going to be a huge net
> > > negative in the long term. Userspace can communicate with the guest just as
> > > easily as KVM, and if you make KVM the middle-man, then you effectively *must*
> > > define a relatively rigid guest/host ABI.
> > >
> > > If instead the contract is between host userspace and the guest, the ABI can be
> > > much more fluid, e.g. if you (or any setup) can control at least some amount of
> > > code that runs in the guest, then the contract between the guest and host doesn't
> > > even need to be formally defined, it could simply be a matter of bundling host
> > > and guest code appropriately.
> > >
> > > If you want to land support for a given contract in upstream repositories, e.g.
> > > to broadly enable paravirt scheduling support across a variety of usersepace VMMs
> > > and/or guests, then yeah, you'll need a formal ABI. But that's still not a good
> > > reason to have KVM define the ABI. Doing it in KVM might be a wee bit easier because
> > > it's largely just a matter of writing code, and LKML provides a centralized channel
> > > for getting buyin from all parties. But defining an ABI that's independent of the
> > > kernel is absolutely doable, e.g. see the many virtio specs.
> > >
> > > I'm not saying KVM can't help, e.g. if there is information that is known only
> > > to KVM, but the vast majority of the contract doesn't need to be defined by KVM.
> > >
> > As you mentioned, custom contract between guest and host userspace is
> > really flexible, but I believe tackling scheduling(especially latency)
> > issues is a bit more difficult with generic approaches. Here kvm does
> > have some information known only to kvm(which could be shared - eg:
> > interrupt injection) but more importantly kvm has some unique
> > capabilities when it comes to scheduling. kvm and scheduler are
> > cooperating currently for various cases like, steal time accounting,
> > vcpu preemption state, spinlock handling etc. We could possibly try to
> > extend it a little further in a non-intrusive way.
>
> I'm not too worried about the code being intrusive, I'm worried about the
> maintainability, longevity, and applicability of this approach.
>
> IMO, this has a significantly lower ceiling than what is possible with something
> like sched_ext, e.g. it requires a host tick to make scheduling decisions, and
> because it'd require a kernel-defined ABI, would essentially be limited to knobs
> that are broadly useful. I.e. every bit of information that you want to add to
> the guest/host ABI will need to get approval from at least the affected subsystems
> in the guest, from KVM, and possibly from the host scheduler too. That's going
> to make for a very high bar.
>
Just thinking out loud, The ABI could be very simple to start with. A
shared page with dedicated guest and host areas. Guest fills details
about its priority requirements, host fills details about the actions
it took(boost/unboost, priority/sched class etc). Passing this
information could be in-band or out-of-band. out-of-band could be used
by dedicated userland schedulers. If both guest and host agrees on
in-band during guest startup, kvm could hand over the data to
scheduler using a scheduler callback. I feel this small addition to
kvm could be maintainable and by leaving the protocol for interpreting
shared memory to guest and host, this would be very generic and cater
to multiple use cases. Something like above could be used both by
low-end devices and high-end server like systems and guest and host
could have custom protocols to interpret the data and make decisions.

In this RFC, we have a miniature form of the above, where we have a
shared memory area and the scheduler callback is basically
sched_setscheduler. But it could be made very generic as part of ABI
design. For out-of-band schedulers, this call back could be setup by
sched_ext, a userland scheduler and any similar out-of-band scheduler.

I agree, getting a consensus and approval is non-trivial. IMHO, this
use case is compelling for such an ABI because out-of-band schedulers
might not give the desired results for low-end devices.

> > Having a formal paravirt scheduling ABI is something we would want to
> > pursue (as I mentioned in the cover letter) and this could help not
> > only with latencies, but optimal task placement for efficiency, power
> > utilization etc. kvm's role could be to set the stage and share
> > information with minimum delay and less resource overhead.
>
> Making KVM middle-man is most definitely not going to provide minimum delay or
> overhead. Minimum delay would be the guest directly communicating with the host
> scheduler. I get that convincing the sched folks to add a bunch of paravirt
> stuff is a tall order (for very good reason), but that's exactly why I Cc'd the
> sched_ext folks.
>
As mentioned above, guest directly talking to host scheduler without
involving kvm would mean an out-of-band scheduler and the
effectiveness depends on how fast the scheduler gets to run. In lowend
compute devices, that would pose a challenge. In such scenarios, kvm
seems to be a better option to provide minimum delay and cpu overhead.

Sorry for not being clear in the cover letter, the goal is to have a
minimal latency and overhead framework that would work for low-end
devices as well where we have constrained cpu capacity. A design with
focus on the constraints of systems with not enough compute capacity
to spare, but caters to generic use cases as well is what we are
striving for. This would be useful for cloud providers whose offerings
are mostly over-committed VMs and we have seen interest from such
crowd.

Thanks,
Vineeth