Re: [patch 0/3] KVM CPU frequency change hypercalls

From: Marcelo Tosatti
Date: Wed Mar 01 2017 - 10:22:58 EST


On Wed, Mar 01, 2017 at 03:21:32PM +0100, Paolo Bonzini wrote:
>
>
> On 28/02/2017 03:45, Marcelo Tosatti wrote:
> > On Fri, Feb 24, 2017 at 04:34:52PM +0100, Paolo Bonzini wrote:
> >>
> >>
> >> On 24/02/2017 14:04, Marcelo Tosatti wrote:
> >>>>>>> Whats the current usecase, or forseeable future usecase, for save/restore
> >>>>>>> across preemption again? (which would validate the broken by design
> >>>>>>> claim).
> >>>>>> Stop a guest that is using cpufreq, start a guest that is not using it.
> >>>>>> The second guest's performance now depends on the state that the first
> >>>>>> guest left in cpufreq.
> >>>>> Nothing forbids the host to implement switching with the
> >>>>> current hypercall interface: all you need is a scheduler
> >>>>> hook.
> >>>> Can it be done in vcpu_load/vcpu_put? But you still would have two
> >>>> components (KVM and sysfs) potentially fighting over the frequency, and
> >>>> that's still a bit ugly.
> >>>
> >>> Change the frequency at vcpu_load/vcpu_put? Yes: call into
> >>> cpufreq-userspace. But there is no notion of "per-task frequency" on the
> >>> Linux kernel (which was the starting point of this subthread).
> >>
> >> There isn't, but this patchset is providing a direct path from a task to
> >> cpufreq-userspace. This is as close as you can get to a per-task frequency.
> >
> > Cpufreq-userspace is supposed to be used by tasks in userspace.
> > Thats why its called "userspace".
>
> I think the intended usecase is to have a daemon handling a systemwide
> policy. Examples are the historical (and now obsolete) users such as
> cpufreqd, cpudyn, powernowd, or cpuspeed. The user alternatively can
> play the role of the daemon by writing to sysfs.
>
> I've never seen userspace tasks talking to cpufreq-userspace to set
> their own running frequency. If DPDK does it, that's nasty in my
> opinion

Please extend what "nasty" means in detail. I really don't understand
why its nasty.

> and we should find an interface that works best for both DPDK
> and KVM. Which should be done on linux-pm like Rafael suggested.
>
> >>> But if you configure all CPUs in the system as cpufreq-userspace,
> >>> then some other (userspace program) has to decide the frequency
> >>> for the other CPUs.
> >>>
> >>> Which agent would do that and why? Thats why i initially said "whats the
> >>> usecase".
> >>
> >> You could just pin them at the highest non-TurboBoost frequency until a
> >> guest runs. That's assuming that they are idle and, because of
> >> isol_cpus/nohz_full, they would be almost always in deep C state anyway.
> >
> > The original claim of the thread was: "this feature (frequency
> > hypercalls) works for pinned vcpu<->pcpu, pcpu dedicated exclusively
> > to vcpu case, lets try to extend this to other cases".
> >
> > Which is a valid and useful direction to go.
> >
> > However there is no user for multiple vcpus in the same pcpu now.
>
> You are still ignoring the case of one guest started after another, or
> of another program started on a CPU that formerly was used by KVM. They
> don't have to be multiple users at the same time.

Just have the cpufreq-userspace policy be instantiated while the
isolated vcpu owns the pcpu. Before/after that, the previous policy
is in place.

> > If there were multiple vcpus, all of them requesting a given
> > frequency, it would be necessary to:
> >
> > 1) Maintain frequency of the pcpu to the highest
> > frequencies.
> >
> > OR
> >
> > 2) Since switching frequencies can take up to 70us (*)
> > (depends on processor), its generally not worthwhile
> > to switch frequencies between task switches.
>
> Is latency that important, or is rather overhead the one to pay
> attention to? The slides you linked
> (http://www.ena-hpc.org/2013/pdf/04.pdf) at page 17 suggest it's around
> 10us.

Ok, be it 10us. 10us overhead on every task context switch is not
acceptable.

> One possibility is to do (1) if you have multiple tasks on the run queue
> (or fallback to what is specified in sysfs) and (2) if you only have one
> task.

Sure, that is alright. But the use-case at hand does not involve
multiple tasks on the pcpu.

> Anyway, please repost with Cc to linux-pm so that we can restart the
> discussion there.
>
> Paolo

Done. Can you please reply with a concise summary of what you object to?