Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

From: David Vernet
Date: Wed Jan 24 2024 - 12:09:32 EST

Next message: Wang, Weilin: "RE: [RFC PATCH v3 09/18] perf stat: Add functions to create new group and assign events into groups for hardware-grouping method"
Previous message: Jakub Kicinski: "Re: [RFC PATCH net-next 9/9] ethtool: Add ability to flash transceiver modules' firmware"
Next in thread: Joel Fernandes: "Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Jan 23, 2024 at 09:15:10PM -0500, Joel Fernandes wrote:
> Hi David,
> I got again side tracked. I'll now prioritize this thread with quicker
> (hopefully daily) replies.

Hi Joel,

Thanks for your detailed reply.

[...]

> > Thanks for clarifying. I think we're on the same page. I didn't mean to
> > imply that KVM is actually in the scheduler making decisions about what
> > task to run next, but that wasn't really my concern. My concern is that
> > this patch set makes KVM responsible for all of the possible paravirt
> > policies by encoding it in KVM UAPI, and is ultimately responsible for
> > being aware of and communicating those policies between the guest to the
> > host scheduler.
> >
> > Say that we wanted to add some other paravirt related policy like "these
> > VCPUs may benefit from being co-located", or, "this VCPU just grabbed a
> > critical spinlock so please pin me for a moment". That would require
> > updating struct guest_schedinfo UAPI in kvm_para.h, adding getters and
> > setters to kvm_host.h to set those policies for the VCPU (though your
> > idea to use a BPF hook on VMEXIT may help with that onme), setting the
> > state from the guest, etc.
>
> These are valid points, and I agree!
>
> >
> > KVM isn't making scheduling decisions, but it's the arbiter of what data
> > is available to the scheduler to consume. As it relates to a VCPU, it
> > seems like this patch set makes KVM as much invested in the scheduling
> > decision that's eventually made as the actual scheduler. Also worth
> > considering is that it ties KVM UAPI to sched/core.c, which seems
> > undesirable from the perspective of both subsystems.
>
> Ok, Agreed.
>
> >
> >> In that sense, I agree with Sean that we are probably forcing a singular
> >> policy on when and how to boost which might not work for everybody (in theory
> >> anyway). And I am perfectly OK with the BPF route as I mentioned in the other
> >
> > FWIW, I don't think it's just that boosting may not work well in every
> > context (though I do think there's an argument to be made for that, as
> > Sean pointed out r.e. hard IRQs in nested context). The problem is also
> > that boosting is just one example of a paravirt policy that you may want
> > to enable, as I alluded to above, and that comes with complexity and
> > maintainership costs.
>
> Ok, agreed.
>
> >
> >> email. So perhaps we can use a tracepoint in the VMEXIT path to run a BPF
> >> program (?). And we still have to figure out how to run BPF programs in the
> >> interrupt injections patch (I am currently studying those paths more also
> >> thanks to Sean's email describing them).
> >
> > If I understand correctly, based on your question below, the idea is to
> > call sched_setscheduler() from a kfunc in this VMEXIT BPF tracepoint?
> > Please let me know if that's correct -- I'll respond to this below where
> > you ask the question.
>
> Yes that's correct.
>
> >
> > As an aside, even if we called a BPF tracepoint prog on the VMEXIT path,
> > AFAIU it would still require UAPI changes given that we'd still need to
> > make all the same changes in the guest, right?
>
> By UAPI, do you mean hypercall or do you mean shared memory? If hypercall, we
> actually don't need hypercall for boosting. We boost during VMEXIT. We only need
> to set some state from the guest, in shared memory to hint that it might be
> needed at some point in the future. If no preemption-induced VMEXIT happens,
> then no scheduler boosting happens (or needs to happen).

So I was referring to setting state in shared memory from the guest.
Specifically, the UAPI defined in [0] and set in [1] (also shown below).
There's no hypercall here, right? But we're still adding UAPI for the
shared data structure written by the guest and consumed by the host.
Please let me know if I'm missing something, which seems very possible.

[0]: https://lore.kernel.org/all/20231214024727.3503870-4-vineeth@xxxxxxxxxxxxxxx/
[1]: https://lore.kernel.org/all/20231214024727.3503870-8-vineeth@xxxxxxxxxxxxxxx/

diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 6b1dea07a563..e53c3f3a88d7 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -167,11 +167,30 @@ enum kvm_vcpu_boost_state {
VCPU_BOOST_BOOSTED
};

+/*
+ * Boost Request from guest to host for lazy boosting.
+ */
+enum kvm_vcpu_boost_request {
+ VCPU_REQ_NONE = 0,
+ VCPU_REQ_UNBOOST,
+ VCPU_REQ_BOOST,
+};
+
+
+union guest_schedinfo {
+ struct {
+ __u8 boost_req;
+ __u8 preempt_disabled;
+ };
+ __u64 pad;
+};
+
/*
* Structure passed in via MSR_KVM_PV_SCHED
*/
struct pv_sched_data {
__u64 boost_status;
+ union guest_schedinfo schedinfo;
};

> There might be a caveat to the unboosting path though needing a hypercall and I
> need to check with Vineeth on his latest code whether it needs a hypercall, but
> we could probably figure that out. In the latest design, one thing I know is
> that we just have to force a VMEXIT for both boosting and unboosting. Well for
> boosting, the VMEXIT just happens automatically due to vCPU preemption, but for
> unboosting it may not.

As mentioned above, I think we'd need to add UAPI for setting state from
the guest scheduler, even if we didn't use a hypercall to induce a
VMEXIT, right?

> In any case, can we not just force a VMEXIT from relevant path within the guest,
> again using a BPF program? I don't know what the BPF prog to do that would look
> like, but I was envisioning we would call a BPF prog from within a guest if
> needed at relevant point (example, return to guest userspace).

I agree it would be useful to have a kfunc that could be used to force a
VMEXIT if we e.g. need to trigger a resched or something. In general
that seems like a pretty reasonable building block for something like
this. I expect there are use cases where doing everything async would be
useful as well. We'll have to see what works well in experimentation.

> Does that make sense?

Yes it does, though I want to make sure I'm not misunderstanding what
you mean by shared memory vs. hypercall as it relates to UAPI.

> > I see why having a BPF
> > hook here would be useful to avoid some of the logic on the host that
> > implements the boosting, and to give more flexibility as to when to
> > apply that boosting, but in my mind it seems like the UAPI concerns are
> > the most relevant.
>
> Yes, lets address the UAPI. My plan is to start a new design document like a
> google doc, and I could share that with you so we can sketch this out. What do
> you think? And perhaps also we can discuss it at LSFMM.

Awesome, thank you for writing that up. I'm happy to take a look when
it's ready for more eyes.

[...]

> >>> The main building block that I was considering is a new kptr [0] and set
> >>> of kfuncs [1] that would allow a BPF program to have one or more R/W
> >>> shared memory regions with a guest.
> >>
> >> I really like your ideas around sharing memory across virt boundary using
> >> BPF. The one concern I would have is that, this does require loading a BPF
> >> program from the guest userspace, versus just having a guest kernel that
> >> 'does the right thing'.
> >
> > Yeah, that's a fair concern. The problem is that I'm not sure how we get
> > around that if we want to avoid tying up every scheduling paravirt
> > policy into a KVM UAPI. Putting everything into UAPI just really doesn't
> > seem scalable. I'd be curious to hear Sean's thoughts on this. Note that
> > I'm not just talking about scheduling paravirt here -- one could imagine
> > many possible examples, e.g. relating to I/O, MM, etc.
>
> We could do same thing in guest, call BPF prog at a certain point, if needed.
> But again, since we only need to bother with VMEXIT for scheduler boosting
> (which doesn't need hypercall), it should be Ok. For the unboosting part, we

Hopefully my explanation above r.e. shared memory makes sense. My worry
isn't just for hypercalls, it's that we also need to make UAPI changes
for the guest to set the shared memory state that's read by the host. If
we instead had the BPF program in the guest setting state via its shared
memory channel with the BPF prog on the host, that's a different story.

> could implement that also using either a BPF prog at appropriate guest hook, or
> just having a timer go off to take away boost if boosting has been done too
> long. We could award a boost for a bounded time as well, and force a VMEXIT to
> unboost if VMEXIT has not already happened yet.. there should be many ways to
> avoid an unboost hypercall..
>
> > It seems much more scalable to instead have KVM be responsible for
> > plumbing mappings between guest and host BPF programs (I haven't thought
> > through the design or interface for that at _all_, just thinking in
> > high-level terms here), and then those BPF programs could decide on
> > paravirt interfaces that don't have to be in UAPI.
>
> It sounds like by UAPI, you mean hypercalls right? The actual shared memory
> structure should not be a UAPI concern since that will defined by the BPF
> program and how it wants to interpret the fields..

See above -- I mean the shared data structure added in patch 3 / 8
("kvm: x86: vcpu boosting/unboosting framework").

> > Having KVM be
> > responsible for setting up opaque communication channels between the
> > guest and host feels likes a more natural building block than having it
> > actually be aware of the policies being implemented over those
> > communication channels.
>
> Agreed.
>
> >
> >> On the host, I would have no problem loading a BPF program as a one-time
> >> thing, but on the guest it may be more complex as we don't always control the
> >> guest userspace and their BPF loading mechanisms. Example, an Android guest
> >> needs to have its userspace modified to load BPF progs, etc. Not hard
> >> problems, but flexibility comes with more cost. Last I recall, Android does
> >> not use a lot of the BPF features that come with the libbpf library because
> >> they write their own userspace from scratch (due to licensing). OTOH, if this
> >> was an Android kernel-only change, that would simplify a lot.
> >
> > That is true, but the cost is double-sided. On the one hand, we have the
> > complexity and maintenance costs of plumbing all of this through KVM and
> > making it UAPI. On the other, we have the cost of needing to update a
> > user space framework to accommodate loading and managing BPF programs.
> > At this point BPF is fully supported on aarch64, so Android should have
> > everything it needs to use it. It sounds like it will require some
> > (maybe even a lot of) work to accommodate that, but that seems
> > preferable to compensating for gaps in a user space framework by adding
> > complexity to the kernel, no?
>
> Yes it should be preferable.
>
> >
> >> Still there is a lot of merit to sharing memory with BPF and let BPF decide
> >> the format of the shared memory, than baking it into the kernel... so thanks
> >> for bringing this up! Lets talk more about it... Oh, and there's my LSFMMBPF
> >> invitiation request ;-) ;-).
> >
> > Discussing this BPF feature at LSFMMBPF is a great idea -- I'll submit a
> > proposal for it and cc you. I looked and couldn't seem to find the
> > thread for your LSFMMBPF proposal. Would you mind please sending a link?
>
> I actually have not even submitted one for LSFMM but my management is supportive
> of my visit. Do you want to go ahead and submit one with all of us included in
> the proposal? And I am again sorry for the late reply and hopefully we did not
> miss any deadlines. Also on related note, there is interest in sched_ext for

I see that you submitted a proposal in [2] yesterday. Thanks for writing
it up, it looks great and I'll comment on that thread adding a +1 for
the discussion.

[2]: https://lore.kernel.org/all/653c2448-614e-48d6-af31-c5920d688f3e@xxxxxxxxxxxxxxxxx/

No worries at all about the reply latency. Thank you for being so open
to discussing different approaches, and for driving the discussion. I
think this could be a very powerful feature for the kernel so I'm
pretty excited to further flesh out the design and figure out what makes
the most sense here.

> more custom scheduling. We could discuss that as well while at LSFMM.

Sure thing. I haven't proposed a topic for that yet as I didn't have
anything new to discuss following last year's discussion, but I'm happy
to continue the discussion this year. I'll follow up with you in a
separate thread.

> >
> >>> This could enable a wide swath of
> >>> BPF paravirt use cases that are not limited to scheduling, but in terms
> >>> of sched_ext, the BPF scheduler could communicate with the guest
> >>> scheduler over this shared memory region in whatever manner was required
> >>> for that use case.
> >>>
> >>> [0]: https://lwn.net/Articles/900749/
> >>> [1]: https://lwn.net/Articles/856005/
> >>
> >> Thanks, I had no idea about these. I have a question -- would it be possible
> >> to call the sched_setscheduler() function in core.c via a kfunc? Then we can
> >> trigger the boost from a BPF program on the host side. We could start simple
> >> from there.
> >
> > There's nothing stopping us from adding a kfunc that calls
> > sched_setscheduler(). The questions are how other scheduler folks feel
> > about that, and whether that's the appropriate general solution to the
> > problem. It does seem odd to me to have a random KVM tracepoint be
> > entitled to set a generic task's scheduling policy and priority.
>
> Such oddities are application specific though. User space can already call
> setscheduler arbitrarily, so why not a BPF program?

Hmm, that's a good point. Perhaps it does make sense. The BPF program
would be no less privileged than a user space application with
sufficient privileges to change a remote task's prio.

> >> I agree on the rest below. I just wanted to emphasize though that this patch
> >> series does not care about what the scheduler does. It merely hints the
> >> scheduler via a priority setting that something is an important task. That's
> >> a far cry from how to actually schedule and the spirit here is to use
> >> whatever scheduler the user has to decide how to actually schedule.
> >
> > Yep, I don't disagree with you at all on that point. To me this really
> > comes down to a question of the costs and long-term design choices, as
> > we discussed above:
> >
> > 1. What is the long-term maintenance cost to KVM and the scheduler to
> > land paravirt scheduling in this form?
> >
> > Even assuming that we go with the BPF hook on VMEXIT approach, unless
> > I'm missing something, I think we'd still need to make UAPI changes to
> > kvm_para.h, and update the guest to set the relevant paravirt state in
> > the guest scheduler.
>
> As mentioned above, for boosting, there is no hypercall. The VMEXIT is induced
> by host preemption.

I expect I am indeed missing something then, as mentioned above. VMEXIT
aside, we still need some UAPI for the shared structure between the
guest and host where the guest indicates its need for boosting, no?

> > That's not a huge amount of code for just boosting
> > and deboosting, but it sets the precedent that any and all future
> > scheduling paravirt logic will need to go through UAPI, and that the
> > core scheduler will need to accommodate that paravirt when and where
> > appropriate.
> >
> > I may be being overly pessimistic, but I feel that could open up a
> > rather scary can of worms; both in terms of the potential long-term
> > complexity in the kernel itself, and in terms of the maintenance burden
> > that may eventually be imposed to properly support it. It also imposes a
> > very high bar on being able to add and experiment with new paravirt
> > policies.
>
> Hmm, yeah lets discuss this more. It does appear we need to do *something* than
> leaving the performance on the table.

Agreed. There is a lot of performance to be gained from this. Doing
nothing is not the answer. Or at least not the answer I'm hoping for.

> >
> > 2. What is the cost we're imposing on users if we force paravirt to be
> > done through BPF? Is this prohibitively high?
> >
> > There is certainly a nonzero cost. As you pointed out, right now Android
> > apparently doesn't use much BPF, and adding the requisite logic to use
> > and manage BPF programs is not insigificant.
> >
> > Is that cost prohibitively high? I would say no. BPF should be fully
> > supported on aarch64 at this point, so it's really a user space problem.
> > Managing the system is what user space does best, and many other
> > ecosystems have managed to integrate BPF to great effect. So while the
> > cost is cetainly nonzero, I think there's a reasonable argument to be
> > made that it's not prohibitively high.
>
> Yes, I think it is doable.
>
> Glad to be able to finally reply, and I shall prioritize this thread more on my
> side moving forward.

Thanks for your detailed reply, and happy belated birthday :-)

- David

Attachment: signature.asc
Description: PGP signature

Next message: Wang, Weilin: "RE: [RFC PATCH v3 09/18] perf stat: Add functions to create new group and assign events into groups for hardware-grouping method"
Previous message: Jakub Kicinski: "Re: [RFC PATCH net-next 9/9] ethtool: Add ability to flash transceiver modules' firmware"
Next in thread: Joel Fernandes: "Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]