Re: [RFC] KVM: x86: Allow userspace exit on HLT and MWAIT, else yield on MWAIT

From: David Woodhouse
Date: Sat Sep 23 2023 - 03:26:36 EST


On Fri, 2023-09-22 at 14:00 +0200, Paolo Bonzini wrote:
> On Mon, Sep 18, 2023 at 11:30 AM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
> >
> > From: David Woodhouse <dwmw@xxxxxxxxxxxx>
> >
> > The VMM may have work to do on behalf of the guest, and it's often
> > desirable to use the cycles when the vCPUS are idle.
> >
> > When the vCPU uses HLT this works out OK because the VMM can run its
> > tasks in a separate thread which gets scheduled when the in-kernel
> > emulation of HLT schedules away. It isn't perfect, because it doesn't
> > easily allow for handling both low-priority maintenance tasks when the
> > VMM wants to wait until the vCPU is idle, and also for higher priority
> > tasks where the VMM does want to preempt the vCPU. It can also lead to
> > noisy neighbour effects, when a host has isn't necessarily sized to
> > expect any given VMM to suddenly be contending for many *more* pCPUs
> > than it has vCPUs.
> >
> > In addition, there are times when we need to expose MWAIT to a guest
> > for compatibility with a previous environment. And MWAIT is much harder
> > because it's very hard to emulate properly.
>
> I don't dislike giving userspace more flexibility in deciding when to
> exit on HLT and MWAIT (or even PAUSE), and kvm_run is a good place to
> do this. It's an extension of request_interrupt_window and
> immediate_exit. I'm not sure how it would interact with
> KVM_CAP_X86_DISABLE_EXITS.

Yeah, right now it doesn't interact at all. The use case is that you
*always* allow vmexits to KVM for the offending instructions, and then
it's just a question of what KVM does when that happens.

> Perhaps KVM_ENABLE_CAP(KVM_X86_DISABLE_EXITS) could be changed to do
> nothing except writing to a new kvm_run field? All the kvm-
> >arch.*_in_guest field would change into a kvm-
> >arch.saved_request_userspace_exit, and every vmentry would do
> something like
>
>   if (kvm->arch.saved_request_userspace_exit != kvm_run->request_userspace_exit) {
>      /* tweak intercepts */
>   }
>
> To avoid races you need two flags though; there needs to be also a
> kernel->userspace communication of whether the vCPU is currently in
> HLT or MWAIT, using the "flags" field for example. If it was HLT only,
> moving the mp_state in kvm_run would seem like a good idea; but not if
> MWAIT or PAUSE are also included.

Right. When work is added to an empty workqueue, the VMM will want to
hunt for a vCPU which is currently idle and then signal it to exit.

As you say, for HLT it's simple enough to look at the mp_state, and we
can move that into kvm_run so it doesn't need an ioctl... although it
would also be nice to get an *event* on an eventfd when the vCPU
becomes runnable (as noted, we want that for VSM anyway). Or perhaps
even to be able to poll() on the vCPU fd.

But MWAIT (as currently not-really-emulated) and PAUSE are both just
transient states with nothing you can really *wait* for, which is why
they're such fun to deal with.

> To set a kvm_run flag during MWAIT, you could reenter MWAIT with the
> MWAIT-exiting bit cleared and the monitor trap flag bit (or just
> EFLAGS.TF) set. On the subsequent singlestep exit, clear the flag in
> kvm_run and set again the MWAIT-exiting bit. The MWAIT handler would
> also check kvm_run->request_userspace_exit before reentering.

Yeah, we've pondered that one. Perhaps coupled with setting the scheduling
priority as low as possible while it's actually on the MWAIT, and
putting it back again afterwards. Something along the lines of 'do not
schedule me unless you literally have *nothing* else to do on this
pCPU, for the next N µs'.

Not pretty, but *nothing* you do with MWAIT is going to be pretty.
Unless we can tolerate 4KiB granularity and actually get the read-only
and minor fault trick working.

Anyway, I knocked this up just for Fred to play with and see what
actually performs reasonably and what doesn't, because I never want to
post even random proof-of-concept kernel patches in private. So we'll
play with it and see what we get out of it.


Attachment: smime.p7s
Description: S/MIME cryptographic signature