Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver

From: David Dai
Date: Thu Aug 24 2023 - 19:56:10 EST


On Sun, Aug 6, 2023 at 8:22 PM Pavan Kondeti <quic_pkondeti@xxxxxxxxxxx> wrote:
>
> On Fri, Aug 04, 2023 at 04:46:11PM -0700, David Dai wrote:
> > Hi Pavan,
> >
> > Thanks for reviewing!
> >
> > On Wed, Aug 2, 2023 at 9:18 PM Pavan Kondeti <quic_pkondeti@xxxxxxxxxxx> wrote:
> > >
> > > On Mon, Jul 31, 2023 at 10:46:09AM -0700, David Dai wrote:
> > > > Introduce a virtualized cpufreq driver for guest kernels to improve
> > > > performance and power of workloads within VMs.
> > > >
> > > > This driver does two main things:
> > > >
> > > > 1. Sends the frequency of vCPUs as a hint to the host. The host uses the
> > > > hint to schedule the vCPU threads and decide physical CPU frequency.
> > > >
> > > > 2. If a VM does not support a virtualized FIE(like AMUs), it queries the
> > > > host CPU frequency by reading a MMIO region of a virtual cpufreq device
> > > > to update the guest's frequency scaling factor periodically. This enables
> > > > accurate Per-Entity Load Tracking for tasks running in the guest.
> > > >
> > > > Co-developed-by: Saravana Kannan <saravanak@xxxxxxxxxx>
> > > > Signed-off-by: Saravana Kannan <saravanak@xxxxxxxxxx>
> > > > Signed-off-by: David Dai <davidai@xxxxxxxxxx>
> > >
> > > [...]
> > >
> > > > +static void virt_scale_freq_tick(void)
> > > > +{
> > > > + struct cpufreq_policy *policy = cpufreq_cpu_get(smp_processor_id());
> > > > + struct virt_cpufreq_drv_data *data = policy->driver_data;
> > > > + u32 max_freq = (u32)policy->cpuinfo.max_freq;
> > > > + u64 cur_freq;
> > > > + u64 scale;
> > > > +
> > > > + cpufreq_cpu_put(policy);
> > > > +
> > > > + cur_freq = (u64)data->ops->get_freq(policy);
> > > > + cur_freq <<= SCHED_CAPACITY_SHIFT;
> > > > + scale = div_u64(cur_freq, max_freq);
> > > > +
> > > > + this_cpu_write(arch_freq_scale, (unsigned long)scale);
> > > > +}
> > > > +
> > >
> > > We expect the host to provide the frequency in kHz, can you please add a
> > > comment about it. It is not very obvious when you look at the
> > > REG_CUR_FREQ_OFFSET register name.
> >
> > I’ll include a KHZ in the offset names.
> >
>

Hi Pavan,

Apologies for the slow responses, I was out on vacation for the week
prior to last week.

> Sure, that would help. Also, can you limit the scale to
> SCHED_CAPACITY_SCALE? It may be possible that host may be running at a
> higher frequency than max_freq advertised on this guest.

Good catch, will include a check to limit freq_scale to SCHED_CAPACITY_SCALE.

>
> > >
> > > > +
> > > > +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > > > + unsigned int target_freq)
> > > > +{
> > > > + virt_cpufreq_set_perf(policy);
> > > > + return target_freq;
> > > > +}
> > > > +
> > > > +static int virt_cpufreq_target_index(struct cpufreq_policy *policy,
> > > > + unsigned int index)
> > > > +{
> > > > + return virt_cpufreq_set_perf(policy);
> > > > +}
> > > > +
> > > > +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > > > +{
> > > > + struct virt_cpufreq_drv_data *drv_data = cpufreq_get_driver_data();
> > > > + struct cpufreq_frequency_table *table;
> > > > + struct device *cpu_dev;
> > > > + int ret;
> > > > +
> > > > + cpu_dev = get_cpu_device(policy->cpu);
> > > > + if (!cpu_dev)
> > > > + return -ENODEV;
> > > > +
> > > > + ret = dev_pm_opp_of_add_table(cpu_dev);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + ret = dev_pm_opp_get_opp_count(cpu_dev);
> > > > + if (ret <= 0) {
> > > > + dev_err(cpu_dev, "OPP table can't be empty\n");
> > > > + return -ENODEV;
> > > > + }
> > > > +
> > > > + ret = dev_pm_opp_init_cpufreq_table(cpu_dev, &table);
> > > > + if (ret) {
> > > > + dev_err(cpu_dev, "failed to init cpufreq table: %d\n", ret);
> > > > + return ret;
> > > > + }
> > > > +
> > > > + policy->freq_table = table;
> > > > + policy->dvfs_possible_from_any_cpu = false;
> > > > + policy->fast_switch_possible = true;
> > > > + policy->driver_data = drv_data;
> > > > +
> > > > + /*
> > > > + * Only takes effect if another FIE source such as AMUs
> > > > + * have not been registered.
> > > > + */
> > > > + topology_set_scale_freq_source(&virt_sfd, policy->cpus);
> > > > +
> > > > + return 0;
> > > > +
> > > > +}
> > > > +
> > >
> > > Do we need to register as FIE source even with the below commit? By
> > > registering as a source, we are not supplying any accurate metric. We
> > > still fallback on the same source that cpufreq implements it.
> >
> > The arch_set_freq_scale() done at cpufreq driver’s frequency updates
> > at cpufreq_freq_transition_end() and cpufreq_driver_fast_switch() only
> > represent the guest’s frequency request. However, this does not
> > accurately represent the physical CPU’s frequency that the vCPU is
> > running on. E.g. There may be other processes sharing the same
> > physical CPU that results in a much higher CPU frequency than what’s
> > requested by the vCPU.
> >
>
> understood that policy->cur may not reflect the actual frequency. Is this
> something needs to be advertised to cpufreq core so that it query the
> underlying cpufreq driver and use it for frequency scale updates. This
> also gives userspace to read the actual frequency when read from sysfs.
>

Adding a ->get() callback in the driver to advertise to the cpufreq
core does not actually update the freq_scale if fast_switch is
enabled. Since fast_switch is required for performance reasons, I
don’t see value in adding ->get().

> In fact, cpufreq_driver_fast_switch() comment says that
> cpufreq_driver::fast_switch() should return the *actual* frequency and
> the same is used to update frequency scale updates. I understand that it
> depends on other things like if host defer the frequency switch, the
> value read from REG_CUR_FREQ_OFFSET may reflect the old value..
>
> May be a comment in code would help.
>

Sounds good, I'll include a comment.

Thanks,
David

> Thanks,
> Pavan
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@xxxxxxxxxxx.
>