Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm

From: David Vernet
Date: Fri Jan 26 2024 - 16:19:42 EST

Next message: T.J. Mercier: "[PATCH v2] mm: memcg: Don't periodically flush stats when memcg is disabled"
Previous message: Mark Brown: "Re: [PATCH v2 23/28] spi: s3c64xx: retrieve the FIFO size from the device tree"
In reply to: Joel Fernandes: "Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Jan 24, 2024 at 08:08:56PM -0500, Joel Fernandes wrote:
> Hi David,

Hi Joel,

>
> On Wed, Jan 24, 2024 at 12:06 PM David Vernet <void@xxxxxxxxxxxxx> wrote:
> >
> [...]
> > > There might be a caveat to the unboosting path though needing a hypercall and I
> > > need to check with Vineeth on his latest code whether it needs a hypercall, but
> > > we could probably figure that out. In the latest design, one thing I know is
> > > that we just have to force a VMEXIT for both boosting and unboosting. Well for
> > > boosting, the VMEXIT just happens automatically due to vCPU preemption, but for
> > > unboosting it may not.
> >
> > As mentioned above, I think we'd need to add UAPI for setting state from
> > the guest scheduler, even if we didn't use a hypercall to induce a
> > VMEXIT, right?
>
> I see what you mean now. I'll think more about it. The immediate
> thought is to load BPF programs to trigger at appropriate points in
> the guest. For instance, we already have tracepoints for preemption
> disabling. I added that upstream like 8 years ago or something. And
> sched_switch already knows when we switch to RT, which we could
> leverage in the guest. The BPF program would set some shared memory
> state in whatever format it desires, when it runs is what I'm
> envisioning.

That sounds like it would work perfectly. Tracepoints are really ideal,
both because BPF doesn't allow (almost?) any kfuncs to be called from
fentry/kprobe progs (whereas they do from tracepoints), and because
tracepoint program arguments are trusted so the BPF verifier knows that
it's safe to pass them onto kfuncs, etc as refernced kptrs.

> By the way, one crazy idea about loading BPF programs into a guest..
> Maybe KVM can pass along the BPF programs to be loaded to the guest?
> The VMM can do that. The nice thing there is only the host would be
> the only responsible for the BPF programs. I am not sure if that makes
> sense, so please let me know what you think. I guess the VMM should
> also be passing additional metadata, like which tracepoints to hook
> to, in the guest, etc.

This I'm not sure I can really share an intelligent opinion on. My first
thought was that the guest VM would load some BPF programs at boot using
something like systemd, and then those progs would somehow register with
the VMM -- maybe through a kfunc implemented by KVM that makes a
hypercall. Perhaps what you're suggesting would work as well.

> > > In any case, can we not just force a VMEXIT from relevant path within the guest,
> > > again using a BPF program? I don't know what the BPF prog to do that would look
> > > like, but I was envisioning we would call a BPF prog from within a guest if
> > > needed at relevant point (example, return to guest userspace).
> >
> > I agree it would be useful to have a kfunc that could be used to force a
> > VMEXIT if we e.g. need to trigger a resched or something. In general
> > that seems like a pretty reasonable building block for something like
> > this. I expect there are use cases where doing everything async would be
> > useful as well. We'll have to see what works well in experimentation.
>
> Sure.
>
> > > >> Still there is a lot of merit to sharing memory with BPF and let BPF decide
> > > >> the format of the shared memory, than baking it into the kernel... so thanks
> > > >> for bringing this up! Lets talk more about it... Oh, and there's my LSFMMBPF
> > > >> invitiation request ;-) ;-).
> > > >
> > > > Discussing this BPF feature at LSFMMBPF is a great idea -- I'll submit a
> > > > proposal for it and cc you. I looked and couldn't seem to find the
> > > > thread for your LSFMMBPF proposal. Would you mind please sending a link?
> > >
> > > I actually have not even submitted one for LSFMM but my management is supportive
> > > of my visit. Do you want to go ahead and submit one with all of us included in
> > > the proposal? And I am again sorry for the late reply and hopefully we did not
> > > miss any deadlines. Also on related note, there is interest in sched_ext for
> >
> > I see that you submitted a proposal in [2] yesterday. Thanks for writing
> > it up, it looks great and I'll comment on that thread adding a +1 for
> > the discussion.
> >
> > [2]: https://lore.kernel.org/all/653c2448-614e-48d6-af31-c5920d688f3e@xxxxxxxxxxxxxxxxx/
> >
> > No worries at all about the reply latency. Thank you for being so open
> > to discussing different approaches, and for driving the discussion. I
> > think this could be a very powerful feature for the kernel so I'm
> > pretty excited to further flesh out the design and figure out what makes
> > the most sense here.
>
> Great!
>
> > > As mentioned above, for boosting, there is no hypercall. The VMEXIT is induced
> > > by host preemption.
> >
> > I expect I am indeed missing something then, as mentioned above. VMEXIT
> > aside, we still need some UAPI for the shared structure between the
> > guest and host where the guest indicates its need for boosting, no?
>
> Yes you are right, it is more clear now what you were referring to
> with UAPI. I think we need figure that issue out. But if we can make
> the VMM load BPF programs, then the host can completely decide how to
> structure the shared memory.

Yep -- if the communication channel is from guest BPF -> host BPF, I
think the UAPI concerns are completely addressed. Figuring out how to
actually load the guest BPF progs and setup the communication channels
is another matter.

Thanks,
David

Attachment: signature.asc
Description: PGP signature

Next message: T.J. Mercier: "[PATCH v2] mm: memcg: Don't periodically flush stats when memcg is disabled"
Previous message: Mark Brown: "Re: [PATCH v2 23/28] spi: s3c64xx: retrieve the FIFO size from the device tree"
In reply to: Joel Fernandes: "Re: [RFC PATCH 0/8] Dynamic vcpu priority management in kvm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]