Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

From: Vineeth Remanan Pillai
Date: Fri Dec 15 2023 - 12:40:52 EST


[...snip...]
> > > IMO, this has a significantly lower ceiling than what is possible with something
> > > like sched_ext, e.g. it requires a host tick to make scheduling decisions, and
> > > because it'd require a kernel-defined ABI, would essentially be limited to knobs
> > > that are broadly useful. I.e. every bit of information that you want to add to
> > > the guest/host ABI will need to get approval from at least the affected subsystems
> > > in the guest, from KVM, and possibly from the host scheduler too. That's going
> > > to make for a very high bar.
> > >
> > Just thinking out loud, The ABI could be very simple to start with. A
> > shared page with dedicated guest and host areas. Guest fills details
> > about its priority requirements, host fills details about the actions
> > it took(boost/unboost, priority/sched class etc). Passing this
> > information could be in-band or out-of-band. out-of-band could be used
> > by dedicated userland schedulers. If both guest and host agrees on
> > in-band during guest startup, kvm could hand over the data to
> > scheduler using a scheduler callback. I feel this small addition to
> > kvm could be maintainable and by leaving the protocol for interpreting
> > shared memory to guest and host, this would be very generic and cater
> > to multiple use cases. Something like above could be used both by
> > low-end devices and high-end server like systems and guest and host
> > could have custom protocols to interpret the data and make decisions.
> >
> > In this RFC, we have a miniature form of the above, where we have a
> > shared memory area and the scheduler callback is basically
> > sched_setscheduler. But it could be made very generic as part of ABI
> > design. For out-of-band schedulers, this call back could be setup by
> > sched_ext, a userland scheduler and any similar out-of-band scheduler.
> >
> > I agree, getting a consensus and approval is non-trivial. IMHO, this
> > use case is compelling for such an ABI because out-of-band schedulers
> > might not give the desired results for low-end devices.
> >
> > > > Having a formal paravirt scheduling ABI is something we would want to
> > > > pursue (as I mentioned in the cover letter) and this could help not
> > > > only with latencies, but optimal task placement for efficiency, power
> > > > utilization etc. kvm's role could be to set the stage and share
> > > > information with minimum delay and less resource overhead.
> > >
> > > Making KVM middle-man is most definitely not going to provide minimum delay or
> > > overhead. Minimum delay would be the guest directly communicating with the host
> > > scheduler. I get that convincing the sched folks to add a bunch of paravirt
> > > stuff is a tall order (for very good reason), but that's exactly why I Cc'd the
> > > sched_ext folks.
> > >
> > As mentioned above, guest directly talking to host scheduler without
> > involving kvm would mean an out-of-band scheduler and the
> > effectiveness depends on how fast the scheduler gets to run.
>
> No, the "host scheduler" could very well be a dedicated in-kernel paravirt
> scheduler. It could be a sched_ext BPF program that for all intents and purposes
> is in-band.
>
Yes, if the scheduler is on the same physical cpu and acts on events
like VMEXIT/VMENTRY etc, this would work perfectly. Having the VM talk
to a scheduler running on another cpu and making decisions might not
be quick enough when we do not have enough cpu capacity.

> You are basically proposing that KVM bounce-buffer data between guest and host.
> I'm saying there's no _technical_ reason to use a bounce-buffer, just do zero copy.
>
I was also meaning zero copy only. The help required from the kvm side is:
- Pass the address of the shared memory to bpf programs/scheduler once
the guest sets it up.
- Invoke scheduler registered callbacks on events like VMEXIT,
VEMENTRY, interrupt injection etc. Its the job of guest and host
paravirt scheduler to interpret the shared memory contents and take
actions.

I admit current RFC doesn't strictly implement hooks and callbacks -
it calls sched_setscheduler in place of all callbacks that I mentioned
above. I guess this was your strongest objection.

As you mentioned in the reply to Joel, if it is fine for kvm to allow
hooks into events (VMEXIT, VMENTRY, interrupt injection etc) then, it
makes it easier to develop the ABI I was mentioning and have the hooks
implemented by a paravirt scheduler. We shall re-design the
architecture based on this for v2.

Thanks,
Vineeth